Apache Pulsar Technology Series - Massive DB data collection and sorting based on Pulsar

Introduction

Apache Pulsar is a multi-tenant, high-performance inter-service message transmission solution that supports multi-tenancy, low latency, read-write separation, cross-regional replication, rapid expansion, flexible fault tolerance and other features. This article is an article in the Pulsar technology series. It mainly introduces the application of Pulsar in the scenario of massive DB Binlog incremental data collection and sorting.

Preface

As a typical representative of the next generation of messaging middleware, Pulsar has been widely used in big data fields, advertising, billing and other scenarios. This article mainly shares the application of Pulsar in the field of big data, DB Binlog incremental data collection and sorting cases, as well as the use and optimization of Pulsar Java SDK during use, for your reference.

1. Background introduction

The scenario shared in this article for incremental data collection and sorting of MySQL Binlog is [Apache InLong] (https://inlong.apache.org/) A sub-capability of the system. You need to use components such as DBAgent (component for collecting Binlog), Sort (component for sorting and warehousing), and US (scheduling system) in Apache InLong.

Figure 1 InLong DBAgent data collection and processing process

As shown in Figure 1, the InLong DBAgent (collecting Binlog) component is implemented in Java language to complete the functions of Binlog synchronization, Binlog data analysis, Binlog data filtering, Binlog data conversion, and sending data and indicators that meet the filtering conditions to the Pulsar cluster.

InLong Sort (sorting and warehousing) is implemented in Java language to complete the subscription of data from the Pulsar cluster, data parsing and conversion, and the final data warehousing operation (Thive).

US Runner (scheduling task) is implemented in Java language. It relies on the US scheduling platform and is triggered through the Pulsar message. Before pulling up the task Runner mounted by the business party, it completes the verification to ensure data integrity, that is, pre-dependency Verify the data collection status, complete indicator data reconciliation, complete end-to-end reconciliation and end-to-end data supplement, etc.

1.1. Functional architecture

Figure 2 Overview of DB data collection and sorting process

As shown in Figure 2, in the Apache InLong system, the incremental data collection and sorting process based on MySQL Binlog mainly consists of the following parts:

  1. InLong Manger: Responsible for the access and delivery of DB collection and sorting configuration.

  2. InLong DBAgent: Responsible for the execution of specific DB collection tasks, the node is stateless, highly available, supports heterogeneous model deployment, and supports DB collection tasks between multiple InLong DBAgents Perform HA scheduling and send data and indicators to the corresponding Pulsar cluster.

  3. Pulsar: is divided into data cluster and indicator cluster, which can be configured with the same cluster address when used.

  4. InLong Sort: Responsible for subscribing to sorting data, processing data conversion and warehousing logic. Supports Exactly Once semantics and multiple warehousing sinks, such as Thive/Hive, Iceberg, Hbase, Clickhouse, etc.

  5. US Runner: US is a scheduling platform. The Runner here refers to the tasks running on it. Currently, it supports indicator reconciliation and end-to-end reconciliation. Only reconciliation passes Only then will the downstream tasks be run to ensure that the data can be used by users under certain quality assurance conditions.

1.2. Collection terminal based on Pulsar

1.2.1 Collection end architecture design

InLong DBAgent serves as the data collection end and sends the collected data to the Pulsar cluster.

InLong DBAgent is a stateless node with capabilities such as breakpoint resuming, single-machine multi-DB task collection, DB collection task HA scheduling, etc. It also supports single-machine multi-deployment and heterogeneous machine model deployment.

Figure 3 DBAgent architecture design

As shown in Figure 3, the Job metadata information synchronized by InLong DBAgent is managed through InLong Manager, and users configure Job metadata through InLong Manager. Multiple InLong DBAgent execution nodes form an InLong DBAgent cluster.

Each InLong DBAgent cluster will select the leader through Zookeeper and generate a node with the Coordinator role, which is responsible for the distribution of DB collection jobs in this cluster.

1.2.2 Production data and indicators

Figure 4 InLong DBAgent single Job data/indicator flow reversal process and time consumption of each part

InLong DBAgent processes the collection of multiple jobs at the same time, as shown in Figure 4, which is the processing flow of a single job within Inlong DBAgent, and different jobs are logically isolated (historical versions have not achieved complete isolation for a long time. Later, Chapter will introduce some of the problems here), that is, different jobs use completely independent logical resources, such as DB connection, data Pulsar Client, data Pulsar Producer, indicator Pulsar Client, indicator Pulsar Producer, and Cache used for aggregation during the intermediate data torsion process , distributed threads and queues, etc., to avoid mutual influence between jobs, and also facilitate job HA scheduling between different InLong DBAgent nodes.

Of course, there are certain risks in this design method, which requires reasonable planning during deployment and operation. Detailed explanations will be given in the following chapters.

In order to ensure the completeness of data, the entire collection and sorting process supports the indicator reconciliation process. The indicator reconciliation here ensures that within each time partition, the number of data successfully collected and sent to Pulsar by InLong DBAgent and the InLong Sort storage write Comparison of the total amount of data after entering into Thive and de-duplicating.

InLong DBAgent ensures data integrity and indicator data accuracy through two-point design.

First, design the confirmation mechanism of the Binlog site. This mechanism ensures the continuity of the collection and pulling process and avoids collection jump issues.

Each job in InLong DBAgent pulls data, parses it, and processes the backward distribution logic (including scenarios where there is no actual backward distribution of data, such as points that need to be skipped, bits generated by heartbeat time Logical sites, etc. also need to be added and Acked. When removed, the current minimum position information will be updated). After that, the position information is saved to a collection of type ConcurrentSkipListSet. When the data is sent to After Pulsar is successful, it will go through the internal site Ack process. While removing the site from ConcurrentSkipListSet, the smallest position in the current set will be updated to the collection site cache through comparison logic. , this cached information is used as the location where the current collection is completed. The background uses periodic threads to synchronize the current collected cache location information to ZK and report it to InLong Manager.

When the InLong DBAgent process is restarted or the Job is scheduled to be executed on a new InLong DBAgent node, the Job needs to first be initialized using the location information saved in ZK to ensure that data continues to be pulled from the location where the last collection was completed.

One thing to note is that locations are updated and saved asynchronously. Therefore, after restarting or HA scheduling, job continuation may generate a small amount of duplicate data.

Secondly, design a one-to-one guarantee mechanism between indicators and data. Indicator data is generated in the success logic of callback processing after the message data is asynchronously sent to the Pulsar message. It is periodically sent to the indicator server through aggregation calculation.

InLong DBAgent's process stop and job stop processes are relatively closed-loop and complex. It is necessary to ensure that all reconciliation indicators after the message sent to Pulsar is successfully sent and the latest position is updated to ZK before stopping the application or job. In abnormal operation conditions such as Kill-9, duplicate data will be generated and indicators will be lost. In this case, the reconciliation process of the partition requires manual intervention.

The environment of the existing network is complex, and the business usage and operation and maintenance scenarios are also diverse. The location confirmation guarantee mechanism cannot completely avoid jumps and data loss. For example, during the collection process, due to the failure of the currently connected DB, the collection triggers a connection switch and pulls data from the new DB node. If the Binlog file data on this node is stored in the fault, that is, the Binlog on the new node is incomplete or the collection The Binlog where the location is located has been cleared. For example, the collection process may be delayed due to a large amount of data or a resource bottleneck on the collection machine, and the collection progress cannot keep up with the cleaning speed of the server-side Binlog. These are all scenarios that have occurred during the operation process. This situation requires timely discovery through monitoring indicators and timely manual intervention.

1.3. Sorting end based on Pulsar

1.3.1 Sorting end architecture design

As the data sorting end, InLong Sort is responsible for subscribing data from the Pulsar cluster, deserializing, converting and warehousing it.

InLong Sort is implemented based on the Flink framework. The implementation process involves many Flink-related mechanisms and concepts. This article will not describe too much. Interested students can go to the Flink community official website to view relevant explanations.

The overall architecture of InLong Sort is shown in Figure 5. The collected data is currently mainly sorted and stored in Thive.

Figure 5 InLong Sort overall architecture

1.3.2 Consumption data

InLong Sort subscribes to consume data in the Pulsar cluster. According to the data processing process, it is roughly divided into four parts as shown in Figure 6. The operators related to indicators are not marked here. Of course, there will be some differences between different storage types.

Figure 6 InLong Sort data processing flow

InLong Sort is a single-task (Oceanus task), multi-Dataflow sorting application. Therefore, each operator needs to handle multiple Dataflow scenarios, and the data flow processing processes between Dataflows are logically isolated.

Source operator, handles the parsing and loading of the Source info part in Dataflow, and handles the subscription and backward distribution of Pulsar messages.

The Deserialization operator handles the parsing of MQ message data, splits it into different field contents according to the configuration, organizes it in Record, and distributes it backwards.

The Sink operator handles the warehousing logic of data.

The Commiter operator handles the submission logic of incoming data. Taking Thive as an example, the Commiter part handles the creation of partitions, the production of US Pulsar messages, etc. The Commiter operator is not required for all storage types. The program will differentiate based on the accessed library type.

The overall processing flow and design of InLong Sort are relatively clear, but the implementation is relatively complex, and the implementation of intermediate operators is also constantly evolving iteratively. This article will not describe too much. Interested students can pay attention to related sharing or Learn more in subsequent articles on related topics.

1.4. Reconciliation based on scheduling platform

Runner is an instance concept executed in the US scheduling system. After InLong Sort sorts the data, the US platform is triggered to execute the corresponding Runner through the Pulsar message. There are mainly two related types of tasks: ‘triggering’ and ‘reconciliation’. The 'trigger' task is an empty task. After receiving the corresponding MQ message, the consumer of US's Pulsar message will indirectly launch the 'reconciliation' task through the 'trigger' task.

2. Pulsar application

During the entire data collection and sorting process, Pulsar serves as a transfer station for data and indicators, receiving data reported by InLong DBAgent and successfully sent data indicators, accepting InLong Sort task subscription data, and receiving DBAgent-Audit subscription indicator data. The following is divided into two sections to introduce the usage scenarios, existing problems and processing experience of collecting and producing Pulsar messages and sorting and consuming Pulsar data respectively.

2.1 Pulsar production

2.1.1 Production scenario

From the introduction to the architectural design of InLong DBAgent in Section 1, we can see that in the process of each InLong DBAgent, 1-N collection jobs need to be run. Each job is responsible for collecting Binlog data on a DB instance, and each job corresponds to a Pulsar. Cluster configuration produces the collected data on this Pulsar cluster. Each Job contains multiple Tasks, and each Task corresponds to a Pulsar Topic. This Topic collects a set of library and table data that meet the filtering conditions. The corresponding relationship when converting to Pulsar is shown in Figure 7 below:

Figure 7 Pulsar SDK object corresponding to data flow within a single Job

It can be seen that the scenario where InLong DBAgent uses Pulsar SDK is that we need to create and maintain 1-M Pulsar Client objects in a single Java process. Moreover, each Pulsar Client object needs to be used to create and maintain 1-N Topic Producer objects.

2.1.2 Problems and Tuning

For the application scenarios described in the previous section, the following issues need to be considered and dealt with:

Question 1: Is the Pulsar Client object maintained globally? If multiple jobs have the same configuration, will they share a Pulsar Client object?

In the old version, this is indeed how we implemented it. This not only reduces the number of Pulsar Client objects, but also reduces the number of collection nodes (each InLong DBAgent deployment node is regarded as a collection node) and the Pulsar cluster /span>. Number of connections

However, in the actual operation process, we encountered the following two problems.

First of all, the amount of data between jobs (between tasks within a job) is unbalanced. Some data amounts may be very large, such as flow data tables, indicator data tables, etc., while some data amounts may be very small, such as For some overseas business orders, some database tables have cyclical characteristics, such as batch updates of data tables every early morning. If you share a Pulsar Client and create a Producer object for production, the data collection progress between jobs will have different orders of magnitude, resulting in mutual influence, which will eventually lead to a large number of collection delays.

Secondly, in order to ensure high availability of data collection, the system needs to have the ability to schedule jobs among multiple InLong DBAgent nodes in the cluster according to the machine load (that is to say, Job 1 may be executed on InLong DBAgent-1 at the last moment, and later on It may be scheduled at a certain moment and executed on InLong DBAgent-2). When multiple jobs share Pulsar Client, it is necessary to dynamically maintain Pulsar Client and Producer according to changes in shared information. This not only increases the difficulty of development and maintenance, but also causes the leakage of Client and Producer objects if not implemented properly, leaving problems for the program. Hidden danger. At the same time, closing the Producer Client may also affect other jobs in the intermediate state or even cause data loss.

After a period of argumentation and consideration, during the version iteration process, a strategy of complete isolation between jobs was implemented, that is, each job maintains its own Pulsar Client object, and based on this object, creates the 1- Producer of N Topics. This completely logically avoids the interaction between jobs. Some readers may ask, are there no interactions between multiple tasks in a job? No, or the impact is basically negligible. This is because each job collects Binlog data in the same DB instance, and the data will only be pulled in order. The data is naturally sequential, and basically no problems will occur between different topics. In addition, the complete isolation between jobs also facilitates job HA scheduling between InLong DBAgent nodes, reducing the difficulty of code development and maintenance.

There is another problem here that I have to mention---namelythe number of connections (occupying FD resources).

On each InLong DBAgent, the maximum number of jobs that the current InLong DBAgent can run at the same time is configured based on the current machine configuration (the so-called heterogeneous machine model). The maximum number of connections between the current node and the Pulsar cluster needs to be estimated according to the following formula (assuming that the 1-N Topic partitions in each Job can cover and distribute to all Broker nodes):

Maximum number of connections = (MaxJobsNum) * (Pre BrokerConnectNum) * (PulsarBrokerNum) * Min (MaxPartitionNum, PulsarBrokerNum)

For example: MaxJobsNum = 60, PreBrokerConnectNum = 2, PulsarBrokerNum = 90 Maximum number of connections = (MaxJobsNum) * (Pre BrokerConnectNum) * (PulsarBrokerNum) * Min (MaxPartitionNum, PulsarBrokerNum) = 97200

This value is a very large value in terms of proportion on general live network machines, and will increase with the increase in the number of Broker nodes and the increase in the number of jobs in a single InLong DBAgent node. In live network deployment, During the operation and maintenance process, corresponding value estimation and deployment planning must be carried out to avoid large-scale process crashes during the operation process without problems in the early stage.

Question 2: When using Pulsar Producer to produce messages, in order to improve efficiency, can multi-threaded production be used?

The answer is yes, we can distribute production messages through multiple threads. However, the following implementation method (pseudocode) may seriously reduce production efficiency:

public Sender extends Thread {
    Producer prodcuer;
    Queue msgQueue;
    public Sender(Producer prodcuer,Queue msgQueue) {
	     this.prodcuer = prodcuer;
	     this.msgQueue = msgQueue;
	}
	public void run() {
	      while(true) {
		    Message msg = msgQueue.poll();
		    producer.asynSend(msg);
	      }
	}
}
.....
PulsarProducer prodcuer = new PulsarProducer();
Queue msgQueue = new Queue();
Sender sender1 = new Sender(prodcuer, msgQueue).start();
Sender sender2 = new Sender(prodcuer, msgQueue).start();

As shown in the pseudo code, multiple threads poll data from msgQueue at the same time, and produce Pulsar messages in an asynchronous (or synchronous, the effect of synchronization will be more obvious) through the same Producer. During the production process, Pulsar SDK will Rotation training among multiple partitions requires concurrency and lock control (interested students can look at the specific implementation of the Producer part in the Pulsar SDK). This way of sharing the Producer does not reflect the advantages of multi-threaded parallel sending. On the contrary, it will increase production time and reduce production efficiency.

If multiple threads are required for concurrent production, each thread needs to use its own Producer object for production. The improvement method is as shown in the figure below:

public Sender extends Thread {
    Queue msgQueue;
    public Sender(String topic ,Queue msgQueue) {
	     this.prodcuer = new Prodcuer(topic);
	     this.msgQueue = msgQueue;
	}
	public void run() {
	      while(true) {
		    Message msg = msgQueue.poll();
		    producer.asynSend(msg);
	      }
	}
}

The above are the production Pulsar messages I found on the collection side during the development, testing, and operation and maintenance processes. They are two representative problems. You can refer to them based on your own business characteristics.

2.2 Pulsar consumption

2.2.1 Consumption scenario

As can be seen from the background introduction in the first section, InLong Sort is implemented based on the Flink framework, using a single task (here refers to the Oceanus task) multi-data flow (multi-Dataflow) approach, that is, under each Oceanus task, 1 -N Dataflow data is sorted and stored in the database. Each Dataflow corresponds to the consumption configuration of a Topic, and a single Dataflow supports subscribing to data from multiple Pulsar clusters. It can be seen that the InLong Sort subscription processing process is somewhat similar to the production message scenario of InLong DBAgent. One process needs to maintain multiple Pulsar Clients based on 1-N Dataflow configurations and process the corresponding 1-N Topic subscriptions.

2.2.2 Problems and Tuning

The message subscription and consumption part of InLong Sort has evolved into two versions. The following describes the processing methods and existing problems of the first version, as well as the improvements of the second version.

Before starting to explain the message subscription part, briefly describe some information about InLong Sort sorting DB data. DB data is currently mainly stored in Thive. Among them, status information such as the location of MQ consumption progress, data partition status, and visibility of inbound files are maintained through Flink's State mechanism, which relies on Flink's Checkpoint mechanism to be periodically saved to persistent storage. At the same time, it relies on the Checkpoint mechanism to control the visibility of file users.

The maintenance of MQ consumption sites and the visibility control of files in partitions directly affect the integrity of the data. For example, if the consumption site has been updated and saved, but the previous messages are not guaranteed to have been dropped into the database, a restart (expected or unexpected restart) will result in data loss. Correspondingly, if each restart starts consumption from the message location that has been processed and the file is already visible, the data will be consumed repeatedly, and the data will be repeatedly stored in the database, resulting in duplication. Therefore, these two points are the top priority in our sorting process.

The following explains in detail the consumption processing process and existing problems of the first version.

The first version, similar to the processing method of Pulsar Flink Connector, is implemented using Pulsar Reader. The original intention of Pulsar Reader design is that each reader subscribes to a Partition of a Topic, that is, the Partition Topic needs to be configured during initialization. At the same time, a random, non-persistent consumer group will be used during the Reader subscription consumption process.

Random subscription groups are very unfriendly to monitoring during operation and maintenance. Every time you restart, you have to re-obtain and configure the monitored consumer group information. In order to facilitate operation and maintenance, the first version took advantage of a vulnerability in the Pulsar Broker version at that time (or a capability that was contrary to the design. It is difficult to guarantee that subsequent versions will continue to exist), that is, a persistent database was assigned to each Reader. , and use the Broker's statistical data of this persistent subscription group to monitor progress.

In addition, during the operation and maintenance process of sorting, the memory, parallelism and other configurations of Flink tasks are often adjusted according to the message volume. Some configuration adjustments will affect the recovery of the State. That is, after some configuration changes, you need to choose not to use Checkpoint. Status recovery starts.

In addition, during the operation process, there will often be a need to re-store a copy of data due to expected and unexpected reasons. Supplementing data from the source seems a bit heavy and requires configuration by the business side. The more convenient way is to re-consume the data from Pulsar's historical position.

At this point, let’s summarize the capabilities we need in the sorting process:   

  1. Facilitates operation and maintenance to monitor consumption progress;   
  2. When not recovering from Checkpoint, data cannot be lost;   
  3. Ability to dynamically reset consumption locations based on demand

From the above description, we can see that the implementation of Reader method is somewhat useless. First of all, there is the problem of consumer group names, which has been clearly described above. The main reason is that the availability of subsequent versions cannot be guaranteed. Secondly, not recovering from Checkpoint may result in lost messages. When not recovering from Checkpoint, you can only choose to start consuming from the beginning or the last (new) position. The former will definitely lead to data duplication, and the latter will most likely lead to data loss. Thirdly, it is not possible to adjust the position after stopping, it can only be adjusted during operation.

In order to solve the potential risks and problems of the Reader method, the second version of the InLong Sort consumption part was changed to the Puslar Consumer implementation.

First of all, the Consumer mode supports the use of persistent subscriptions to consumer groups, which facilitates operation and maintenance to monitor consumption progress. This mechanism is in line with Pulsar's design expectations and does not involve compatibility issues. Secondly, the Consumer mode supports the reset position operation during operation and after the program stops, and the application scenarios are richer. Thirdly, the Consumer method supports multiple subscription modes, namely Shared, Exclusive, Failover, etc., and the scenario of sorting consumption is very suitable for using the Exclusive method.

Similar to the Reader method, when creating a Consumer using the Exclusive mode in InLong Sort, you also need to specify the Parititon Topic.

In particular, why does InLong Sort not use Shared mode to create Consumer? The most important thing is to ensure the integrity of the data.

Students who are familiar with Pulsar's design and implementation mechanism will know that Pulsar's consumption model is very different from the design of previous generation MQs such as RockerMQ and Kafka. Interested students can refer to the relevant documents of the Pulsar community. What problems will there be if Shared is used in InLong Sort?

InLong Sort is a Flink task with the concepts of operators and parallelism. If the Source (the operator of the consumer who subscribes to the Pulsar Topic message) uses the Shared method to create a consumer, all consumers created for the target Topic will consume this Topic. News, how to save the consumption location?

If you use the location recorded on the Broker side to start consuming when restarting, this is obviously a problem, because there is no guarantee that the message before this location has been successfully stored in the database during restart (normal or unexpected).

If you restore from Checkpoint when restarting, using the location recorded in the corresponding State information, how to save the State information here? Because all Consumers will consume the data of each Partition Topic, that is to say, every Consumer within each degree of parallelism will have a piece of Ack consumption location information. So where do you start from after restarting? In order not to lose data, we have to aggregate all State information and select the smallest position for each Partition Topic to reset consumption. This will inevitably lead to data duplication. Not only does it increase the complexity of the program and the size of the Checkpoint, but it also has to use the Union State type for saving. When the class data is too large, it is very unfriendly to the task during restart, and may even cause the task to fail to start.​ 

The above is some of my experience in analysis and processing when using Pulsar in the process of data sorting. You can refer to it.

3. Summary

This article shares an Apache InLong incremental DB data collection case. First, the overall architecture and some capabilities of InLong DBAgent, InLong Sort, US Reconciliation Runner and other parts are introduced respectively. After that, I focused on sharing some experiences of using Pulsar in the collection and sorting process for everyone's reference. The detailed design and implementation details of each component of Apache InLong can be shared in the Apache InLong community or in documents and courses on related topics.

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4587289/blog/10143388