Exploration and Practice of Flink CDC & MongoDB Joint Real-time Data Warehouse

Abstract: This article is compiled from XTransfer technical expert, Flink CDC Maintainer Sun Jiabao, sharing in the Flink Forward Asia 2022 data integration session. The content of this article is mainly divided into four parts:

  1. Exploration of MongoDB in real-time data warehouse

  2. The implementation principle and usage practice of MongoDB CDC Connector

  3. FLIP-262 Feature preview of MongoDB Connector

  4. Summary and Outlook

Click to view the original video & speech PPT

1. Exploration of MongoDB in real-time data warehouse

MongoDB is a non-relational document database that supports large-scale data storage and flexible storage structure, and has relatively large-scale applications within XTransfer.

In addition, XTransfer is also actively exploring real-time data warehouses. In addition to the currently popular way of building real-time data warehouses based on lake technology, Flink and MongoDB also have the potential to build lightweight real-time data warehouses.

1.1 Introduction to MongoDB

1

MongoDB is a document-oriented non-relational database that supports semi-structured data storage. It is also a distributed database, providing two cluster deployment modes of replica set and fragmentation level, with high availability and horizontal expansion capabilities, suitable for large-scale data storage. In addition, MongoDB also introduced the Change Streams feature after version 3.0, which supports and simplifies the change subscription of the database.

1.2 Common real-time architecture selection

2

  • A real-time data warehouse for Flink and Kafka pure real-time links.

The advantages include high data freshness; faster data writing; Kafka's peripheral components have a better ecology.

Defects include that intermediate results cannot be checked. Kafka is a linear storage that records every change of data, so if you want to get the latest mirror value, you need to traverse all the records in Kafka, so it is impossible to perform more flexible and fast OLAP queries, and it is also difficult to troubleshoot problems; Kafka's hot and cold separation has yet to be realized, and some cheap storage cannot be fully utilized; this architecture generally requires the maintenance of two additional streaming batch architectures, which will increase the cost of deployment, development, and operation and maintenance.

3

  • Real-time data warehouse architecture based on lake storage.

At present, the popular data lakes Iceberg and Hudi support batch reading and streaming reading capabilities at the same time, and can realize the computing power of streaming and batch integration through Flink. Secondly, lake storage will fully consider how to use cheap storage in terms of storage. , which has a lower storage cost than Kafka.

However, the real-time data warehouse based on lake storage also has some disadvantages, including high deployment costs, such as the need to additionally deploy some OLAP query engines. Secondly, additional components are required to support data permissions.

4

  • Real-time data warehouse based on MongoDB.

MongoDB itself supports large-scale data set storage and flexible data formats; MongoDB has low deployment costs, less component dependencies, and complete permission control. Compared with other real-time data warehouse architectures, Flink and MongoDB also have the potential to build lightweight real-time data warehouses. This mode requires Flink to have streaming read, batch read, dimension table query and row-level write capabilities for MongoDB.

At present, the full incremental integrated streaming query can be provided by the Flink CDC MongoDB Connector, and the function of batch reading dimension table query and writing can be provided by the FLIP-262 MongoDB Connector.

2. Implementation principle and usage practice of MongoDB CDC Connector

2.1 MongoDB CDC Connector

5

The MongoDB CDC Connector was developed by the XTransfer infrastructure team and contributed to the Flink CDC community. Officially introduced in Flink CDC version 2.1.0, it supports full-incremental integrated CDC reading and metadata extraction functions; in version 2.1.1, it supports connecting to MongoDB without authentication enabled; in version 2.2.0, it supports Regular expression filtering function; in version 2.3.0, based on the incremental snapshot reading frame, the parallel incremental snapshot reading function is realized.

2.2 Change Streams Features

6

MongoDB CDC Connector is implemented based on the MongoDB Change Streams feature. MongoDB is a distributed database. In a distributed environment, cluster members generally replicate each other to ensure data integrity and fault tolerance. Similar to MySQL's Binlog, MongoDB also provides oplog to record the operation changes of data, and the secondary nodes synchronize data by accessing the change log of the primary node.

We can also obtain database changes by directly traversing the MongoDB oplog. However, a sharded cluster generally consists of multiple shards, and each shard is generally an independent replica set. The oplog on a shard only contains the data within its shard range, so we need to traverse the oplog on all shards and sort and merge them according to time, which is obviously more complicated.

It is worth mentioning that after MongoDB version 3.6, the Change Streams feature was introduced to provide a simpler API to simplify data subscription.

7

Using the Change Streams API, we can shield the complexity of traversing the oplog and integrating it, and support multiple levels of subscription methods such as instances, libraries, and collections, as well as a complete failure recovery mechanism.

2.3 Failure recovery of Change Streams

8

MongoDB uses ResumeToken to perform breakpoint recovery. Each record returned by Change Streams will carry a ResumeToken. Each ResumeToken corresponds to a specific record in the oplog, indicating the position of the oplog that has been read. In addition, the change time and the information of changing the primary key of the document are also recorded. Through methods such as ResumeAfter and startAfter, the interrupted Change Streams can be resumed by using ResumeToken as the initial parameter.

The ResumeToken of Change Streams is a string encoded by MongoDB KeyStream, and its structure is shown on the left side of the figure above. ts represents the time when the data is changed, ui represents the UUID of the changed collection, and o2 represents the primary key of the changed document. For detailed oplog field description, please refer to  oplog_entry  .

The right side of the figure above is a specific record of the oplog, which describes a change of a record under the primary key at the end of 107, changing the weight field to 5.4. It is worth mentioning that MongoDB does not provide complete mirror values ​​before and after the change in version 6.0. This is one of the reasons why we did not directly use oplog to implement MongoDB CDC Connector.

2.4 Evolution of Change Streams

9

MongoDB officially introduced the change stream feature in version 3.6, but only supports subscriptions to a single collection. Version 4.0 supports instance and library-level subscriptions, and also supports the function of specifying a timestamp to start a change stream. The postBatchResumeToken was introduced in version 4.0.7:

After opening a change stream before version 4.0, if no new change data is generated, the latest ResumeToken will not be obtained. If a failure occurs at this time and you try to use an older ResumeToken to recover, the performance of the server may be reduced, because the server may need to scan more oplog entries. If the oplog corresponding to the ResumeToken is cleared, the change stream cannot be recovered.

To solve this problem, MongoDB 4.0 provides postBatchResumeToken, which marks the position of the oplog that has been scanned, and will continue to advance over time. In addition, using this feature, we can more accurately locate the current location of Change Streams consumption, and then realize the incremental snapshot read function.

In MongoDB 4.2, you can use startAfter to handle some invalid events. In MongoDB 5.1, a series of optimizations have been made to Change Streams. In version 6.0 of MongoDB, it provides complete information of Change Streams pre- and post-image values, as well as a subscription mechanism for Schema changes.

2.5 MongoDB CDC Connector

10

The implementation principle of MongoDB CDC Connector is to use the characteristics of Change Streams to convert change events such as addition, deletion, and modification into Flink's upsert type change stream. In the Flink SQL scenario, Planner will add the Changelog Normalize operator to standardize the upsert type change flow. Combined with Flink's powerful computing capabilities, it is easy to realize real-time ETL and even computing scenarios of heterogeneous heterogeneous data sources.

11

In Flink CDC version 2.3, relying on the incremental snapshot reading framework, the lock-free snapshot reading function is realized, and concurrent snapshots are supported, which greatly shortens the snapshot time. The overall process of incremental snapshot reading is shown in the figure above. In order to parallelize snapshots, the complete dataset must first be split into chunks. Allocate these blocks to different Source Readers to read in parallel to improve the speed of the entire snapshot. However, the primary key of MongoDB is mostly ObjectId, which cannot be segmented by simply increasing the range. Therefore, the segmentation strategy for MongoDB needs to be designed separately.

12

MongoDB has the following three segmentation strategies, which refer to the Mongo Spark project.

  • The first segmentation strategy uses the Sample command to randomly sample the collection, and then calculates the number of buckets based on the average size of the document. Then the sampled data is allocated to each bucket to form the boundary of Chunk. The advantage is that it is fast and suitable for collections with a large amount of data and not fragmented. The disadvantage is that using sampling estimation, the size of the Chunk cannot be absolutely uniform.

  • The second splitting strategy uses the SplitVector command. SplitVector is an internal command for MongoDB sharding calculation split point, which calculates the boundary of each node's Chunk by accessing the specified index. The advantage is that the speed is fast and the size of the Chunk is uniform, but it additionally requires the execution permission of the SplitVector command.

  • The third segmentation strategy is for fragmented collections. For collections that have been fragmented, we do not need to recalculate its fragmentation results, and can directly read the fragmented results of MongoDB as the boundary of Check. The advantage is that the speed is fast and the size of the Chunk is even. However, the size of the Chunk cannot be adjusted and depends on MongoDB's own configuration for each shard. The default size is 64mb. In addition, it additionally requires the read permission to the config library.

13

Next, we will introduce the process of incremental snapshot reading. For a segmented block, the current Change Streams positions are recorded before and after the snapshot is executed. After the snapshot ends, the change stream is replayed according to the start and end point ranges of the snapshot, and finally the snapshot record and change record are merged by Key to obtain a complete result and avoid sending duplicate data.

14

In the incremental read phase of a single Chunk, we read the snapshot data within the Chunk range and the incremental data within the Chunk range, and merge them. However, the overall snapshot process may not be over yet, so the blocks that have completed the snapshot may still change at a later time, so we need to compensate for these changed data. Start the change stream from the global lowest high water point, and compensate for the changed data whose change time is higher than the Chunk high water point. When the highest point of the global snapshot is reached, our compensation can end.

15

Next, some production suggestions about MongoDB CDC Connector are introduced.

  • First, using the incremental snapshot feature, the minimum available version of MongoDB is version 4.0. Because before version 4.0, ResumeToken cannot be obtained when there is no change, and it cannot be started from a specified point in time, so it is difficult to realize the incremental snapshot feature.

    After MongoDB version 4.0.7, postBatchResumeToken is introduced, which makes it easier to obtain the current Change Streams location, so the recommended version is above 4.0.7.

16

  • Second, the size of the control document should not exceed 8mb, because MongoDB has a limit of 16 mb for a single document. Because the changed document contains some additional information, such as which fields are modified, etc., even if the original document does not exceed 16mb, the changed document will not exceed 16mb. would exceed the 16mb limit, causing Change Streams to terminate abnormally. This should be a defect of MongoDB Change Streams.

    Change documents on MongoDB can exceed the 16mb limit, which has been pushed in MongoDB's issue.

17

  • Third, shards in MongoDB are actually allowed to be modified after opening a transaction. However, modifying the shard key may cause frequent movement of shards, causing additional performance overhead. In addition, modifying the shard key may also cause the Update Lookup function to fail, which may lead to inconsistent results in the CDC scenario.

3. Function preview of FLIP-262 MongoDB Connector

We introduced the MongoDB CDC Connector above, which can perform incremental CDC reads on MongoDB, but if we want to build a real-time data warehouse on MongoDB, we also need the ability to read, write and lookup MongoDB in batches. These functions are implemented in FLIP-262 MongoDB Connector, and the first version has been released.

3.1 FLIP-262 Introduce MongoDB Connector

18

In terms of parallel reading, MongoDB Connector is implemented based on the new Source API of FLIP-27; it supports batch reading; it supports Lookup. In terms of parallel writing, it is implemented based on FLIP-177 Sink API; it supports Upsert writing. In terms of Table API, the FLIP-95 Table API is implemented to read or write using Flink SQL.

3.2 Read MongoDB

19

First, we insert some test data in MongoDB, and then use Flink SQL to define a users table. Through the select statement, we can get the results shown on the right. It can be found that the results on the right are consistent with the test data we inserted.

3.3 Write to MongoDB

20

First, we define a results table of users snapshot, which corresponds to the collection of MongoDB users snapshots. Then we read and write the data in the users table collection defined above to MongoDB through the insert statement of Flink SQL.

Finally, query our newly defined result table, and its result is shown on the right. It can be found that its result is consistent with the result of the previous source table, which means that we have successfully written a new set.

3.4 Used as a dimension table association

21

Next, let's demonstrate the scenario of using the user table defined above as a dimension table for Lookup.

First, we define a pageviews fact table, user_id as the Lookup Key, corresponding to the primary key of the users table we defined earlier. Then we query the pageviews table to get the result on the right.

22

Then define a result table to represent the result after payment. This result table is used as a dimension table association for users to supplement some area information. Then we associate the pageviews fact table with the users dimension table through Flink SQL, and write it into the result table. Then query the result table to get the widened user_region information. As shown in the figure on the right, the widened user_region is in the last column, which means that our Lookup is successful.

4. Summary and Outlook

4.1 Summary

23

So far, the real-time data warehouse architecture of Flink and MongoDB can be realized, and there is an additional choice when building a real-time data warehouse. As shown in the figure, a complete set of streaming links is completed through the CDC Connector, which assists Lookup in data widening. Complete a set of batch links through the Source Connector, and finally store the intermediate results of the calculation through the Sink Connector, then the entire set of real-time data warehouse architecture can be realized.

4.2 Existing problems

There are still the following problems:

  • Changelog Normalize is a stateful operator, which requires some additional state overhead.

  • The complete state extraction of Update Lookup also requires certain query overhead.

  • There is a 16mb limit on document size in MongoDB. If there are some large single pieces of data, then this architecture may not be suitable.

4.3 Future Planning

On the MongoDB CDC Connector side, we need:

  • MongoDB version 6.0 is supported.

  • Supports starting at a specified time point.

  • Push forward Changelog Normalize optimization.

On the MongoDB Connector side, we need:

  • Predicate pushdown is supported.

  • AsyncLookup is supported.

  • AsyncSink is supported.

Click to view the original video & speech PPT

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/131777948