Kafka asynchronous dual-active solution mirror maker2 in-depth analysis

mirror maker2 background

Under normal circumstances, we use a set of Kafka cluster to process business. However, in some cases, another Kafka cluster needs to be used for data synchronization and backup. In the early version of Kafka, Kafka launched a tool called mirror maker for this scenario (mirror maker1, the following mm1 stands for mirror maker1) to synchronize the data of two Kafka clusters.

The initial version of the mirror maker is essentially a consumer + producer program. But it has many shortcomings, including

DefaultPartitioner

Because of these problems, the mirror maker is difficult to use in a production environment. So Kafka version 2.4 introduced a new mirror maker2 (mm2 below means mirror maker2). Mirror maker2 is based on the kafka connect tool, which solves most of the problems mentioned above.

Today I mainly introduce the design, main functions and deployment of mirror maker2.

Design and function

Overall design

Mirror maker2 is developed based on the Kafka connect framework. You can simply regard mirror maker2 as a combination of several source connectors and sink connectors. include:

  • MirrorSourceConnector, MirrorSourceTask: connector used to synchronize data
  • MirrorCheckpointConnector, MirrorCheckpointTask: the connector used to synchronize auxiliary information, where the auxiliary information is mainly the consumer's offset
  • MirrorHeartbeatConnector, MirrorHeartbeatTask: the connector that maintains the heartbeat

However, although mirror maker 2 is based on the Kafka connect framework, it has been modified to a certain extent. A mirror maker2 cluster can be deployed separately, of course, it can also be deployed on a kafka connect stand-alone or kafka connect cluster environment. This part will be introduced later when the deployment is introduced.

Like mm1, in the simplest master-slave backup scenario, mm2 is recommended to be deployed in the target cluster, that is, consume from a remote and write locally. If deployed on the source cluster side, data loss may occur when errors occur.

The overall structure is shown in the figure:

Kafka asynchronous dual-active solution mirror maker2 in-depth analysis

 

Internal topic design

mm2 will generate multiple internal topics in Kafka to store the state and configuration information related to the topic of the source cluster, and to maintain the heartbeat. There are mainly three internal topics:

  • hearbeat topic
  • checkpoints topic
  • offset sync topic

These internal topics are relatively easy to understand. You can basically know what they are used by looking at the names. It is worth mentioning that the checkpoints and hearbeat functions can be turned off through configuration. Below we introduce in detail the functions and data formats of these topics.

heartbeat topic

In the default configuration, both the source cluster and the target cluster have a topic for sending heartbeats. Through this topic, the consumer client can confirm whether the current connector is alive on the one hand, and whether the source cluster is available on the other hand.

The schema of the heartbeat topic is as follows:

  • target cluster: receiving heartbeat cluster
  • source cluster: the cluster that sends the heartbeat
  • timestamp: timestamp

checkpoints topic

The corresponding connector (ie MirrorCheckpointConnector) periodically sends checkpoint information to the target cluster, mainly the offset submitted by the consumer group, and related auxiliary information.

The schema of the checkpoints topic is as follows:

  • consumer group id (String)
  • topic (String): the topic containing the source cluster and target cluster
  • partition (int)
  • upstream offset (int): The offset submitted by the specified consumer group of the source cluster (latest committed offset in source cluster)
  • downstream offset (int): the synchronized offset of the target cluster (latest committed offset translated to target cluster)
  • metadata (String): partition metadata
  • timestamp

Another function provided by mm2, consumer switching cluster consumption is achieved through this topic. Because this topic stores the consumption offset of the source cluster consumer group, in some scenarios (such as the source cluster failure) to switch the consumer to the target cluster, you can obtain the consumption offset through this topic and continue consumption.

offset sync

This topic is mainly to synchronize the offset of the topic partition between the two clusters. The offset here is not the consumer's offset, but the log offset.

Its schema is as follows:

  • topic (String):topic 名
  • partition (int)
  • upstream offset (int): the offset of the source cluster
  • downstream offset (int): the offset of the target cluster, which should be consistent with that of the source cluster

config sync

mm2 will synchronize the data of the source cluster to the target cluster. What are the read and write permissions of the topic corresponding to the target cluster? mm2 agreed that the topic corresponding to the target cluster (the one backed up by the source cluster) can only be written by the source and sink connectors. To implement this strategy, MM2 uses the following rules to propagate ACL policies to downstream topics:

  • If the user has read permission to the topic of the source cluster, then the topic corresponding to the target cluster also has the read permission
  • Except mm2, no other users can write to the topic corresponding to the target cluster

At the same time, topic related configuration information will be synchronized

acl

Consumer switch cluster

The consumer group offset of the source cluster is stored in the checkpoint topic of the target cluster, which we have already mentioned above. To get these offset information, you can use the MirrorClient#remoteConsumerOffsets api, and then you can use consumer#seek api to consume according to the given offset.

Here is the general code by the way. First, maven adds dependencies:

<dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>connect-mirror</artifactId>
      <version>2.4.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>connect-mirror-client</artifactId>
      <version>2.4.0</version>
    </dependency>

Then get the offset information:

MirrorMakerConfig mmConfig = new MirrorMakerConfig(mm2.getProp());
        MirrorClientConfig mmClientConfig = mmConfig.clientConfig("target-cluster");
        MirrorClient mmClient = new MirrorClient(mmClientConfig);

        Map<TopicPartition, OffsetAndMetadata> offsetMap =
                mmClient.remoteConsumerOffsets("my-consumer-group", "source-cluster", Duration.ofMinutes(1));

The usage of consumer#seek will not be demonstrated.

Other functions

Finally, by the way, introduce other more basic functions.

The source cluster and target cluster partition are kept in sync

  • The partitioning and sorting of messages, the source cluster and target cluster will remain the same
  • The number of partitions of the target cluster remains the same as that of the source cluster
  • The target cluster will only have one topic corresponding to the source cluster topic
  • The target cluster will only have one partition corresponding to the partition of the source cluster
  • The partition i of the target cluster corresponds to the partition i of the source cluster

To put it bluntly, the partitions and messages of the source cluster and the target cluster will be as consistent as possible. Of course, there may be duplicate messages, because exactly-once is not specified yet, and it is said that subsequent versions will be available (after version 2.4).

Add prefix for synchronization topic

There is a flaw in mm1, because when mm1 backs up data, the topic names of the source cluster and the target cluster are the same, so there may be infinite recursion of messages from the two clusters (that is, two topics with the same name, one message a is transmitted). b, b and then a, cyclically). mm2 solves this defect by adding a prefix to the topic. If two clusters back up each other, the messages of the topic with the prefix will not be backed up.

Synchronize configuration and acl

In mm1, the configuration information and topic acl-related information will not be synchronized, which will bring certain difficulties to cluster management. mm2 solves this problem, that is, the configuration of the source cluster and the acl will be automatically synchronized to the target cluster.

After talking about the functions, I will introduce the deployment method at the end.

Deployment method

Currently mainly supports three deployment methods

  • mm2 dedicated cluster deployment: without relying on kafka connect, mm2 has provided a driver to deploy mm2 clusters separately, and it can be started with only one command: ./bin/connect-mirror-maker.sh mm2.properties
  • Rely on kafka connect cluster deployment: you need to start the kafka connect cluster mode first, and then manually start each mm2 related connector, which is relatively cumbersome. It is suitable for scenarios where there is already a Kafka connect cluster.
  • Rely on Kafka connect stand-alone deployment: You need to configure each connector in the configuration file, and then start the Kafka connect stand-alone service. However, this method is not as convenient as the mm2 dedicated cluster mode, and the stability is not as good as the kafka connect cluster mode, which is suitable for deployment in a test environment.

Refer to KIP-382 for mm2 related configuration. The main configuration includes the broker configuration of source and target, whether the hearbeat and checkpoint functions are enabled, and the synchronization interval.

mm2 independent cluster deployment

To deploy mm2 cluster is relatively simple, just write a configuration file in config/mm2.properties:

# 指定两个集群,以及对应的host
clusters = us-west, us-east
us-west.bootstrap.servers = host1:9092
us-east.bootstrap.servers = host2:9092

# 指定同步备份的topic & consumer group,支持正则
topics = .*
groups = .*
emit.checkpoints.interval.seconds = 10

# 指定复制链条,可以是双向的
us-west->us-east.enabled = true
# us-east->us-west.enabled = true  # 双向,符合条件的两个集群的topic会相互备份

# 可以自定义一些配置
us-west.offset.storage.topic = mm2-offsets
# 也可以指定是否启用哪个链条的hearbeat,默认是双向hearbeat的
us-west->us-east.emit.heartbeats.enabled = false

Then use one command to start it, ./bin/connect-mirror-maker.sh mm2.properties. After startup, use jps to observe the process, and then list topic, you can find that there are many more topics, and the startup should be successful at this time.

By the way, if you are using a Kafka connect cluster, you need to manually start each connector, similar to this:

PUT /connectors/us-west-source/config HTTP/1.1
 
{
    "name": "us-west-source",
    "connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
    "source.cluster.alias": "us-west",
    "target.cluster.alias": "us-east",
    "source.cluster.bootstrap.servers": "us-west-host1:9091",
    "topics": ".*"
}

Above~

Original link: http://www.cnblogs.com/listenfwind/p/14269259.html

If you think this article is helpful to you, you can follow my official account and reply to the keyword [Interview] to get a compilation of Java core knowledge points and an interview gift package! There are more technical dry goods articles and related materials to share, let everyone learn and progress together!

Guess you like

Origin blog.csdn.net/weixin_48182198/article/details/112562573
Recommended