Is 'exactly once' only for streams (topic1 -> app -> topic2)?

b15 :

I have an architecture where we have two separate applications. The original source is a sql database. App1 listens to CDC tables to track changes to tables in that database, normalizes, and serializes those changes. It takes those serialized messages and sends them to a Kafka topic. App2 listens to that topic, adapts the messages to different formats, and sends those adapted messages to their respective destinations via HTTP.

So our streaming architecture looks like:

SQL (CDC event) -> App1 ( normalizes events) -> Kafka -> App2 (adapts events to endpoints) -> various endpoints

We're looking to add error handling in case of failure and cannot tolerate duplicate events, missing events, or changing of order. Given the architecture above, all we really care about is that exactly-once applies to messages getting from App1 to App2 (our separate producers and consumers)

Everything I'm reading and every example I've found of the transactional api points to "streaming". It looks like the Kafka streaming api is meant for an individual application that takes an input from a Kafka topic, does its processing, and outputs it to another Kafka topic, which doesn't seem to apply to our use of Kafka. Here's an excerpt from Confluent's docs:

Now, stream processing is nothing but a read-process-write operation on a Kafka topic; a consumer reads messages from a Kafka topic, some processing logic transforms those messages or modifies state maintained by the processor, and a producer writes the resulting messages to another Kafka topic. Exactly once stream processing is simply the ability to execute a read-process-write operation exactly one time. In this case, “getting the right answer” means not missing any input messages or producing any duplicate output. This is the behavior users expect from an exactly once stream processor.

I'm struggling to wrap my head around how we can use exactly-once with our Kafka topic, or if Kafka's exactly-once is even built for non-"streaming" use cases. Will we have to build our own deduplication and fault tolerance?

Michael G. Noll :

If you are using Kafka's Streams API (or another tool that supports exactly-once processing with Kafka), then Kafka's exactly-once semantics (EOS) are covered across apps:

topic A --> App 1 --> topic B --> App 2 --> topic C

In your use case, one question is whether the initial CDC step supports EOS, too. In other words, you must ask the question: Which steps are involved, and are all steps covered by EOS?

In the following example, EOS is supported end-to-end if (and only if) the initial CDC step supports EOS as well, like the rest of the data flow.

SQL --CDC--> topic A --> App 1 --> topic B --> App 2 --> topic C

If you use Kafka Connect for the CDC step, then you must check whether the used connector you supports EOS yes or no.

Everything I'm reading and every example I've found of the transactional api points to "streaming".

The transactional API of the Kafka producer/consumer clients provide the primitives for EOS processing. Kafka Streams, which sits on top of the producer/consumer clients, uses this functionality to implement EOS in a way that it can be used easily by developers with a few lines of code (such as automatically taking care of state management when an application needs to do a stateful operation like an aggregation or join). Perhaps that relation between producer/consumer <-> Kafka Streams was your confusion after reading the documentation?

Of course, you can also "build your own" by using the underlying Kafka producer and consumer clients (with the transactional APIs) when developing your applications, but that's more work.

I'm struggling to wrap my head around how we can use exactly-once with our Kafka topic, or if Kafka's exactly-once is even built for non-"streaming" use cases. Will we have to build our own deduplication and fault tolerance?

Not sure what you mean by "non-streaming" use cases. If you mean, "If we don't want to use Kafka Streams or KSQL (or another existing tool that can read from Kafka to process data), what would we need to do achieve EOS in our applications?", then the answer is "Yes, in this case you must use the Kafka producer/clients directly, and ensure that whatever you are doing with them properly implements EOS processing." (And because the latter is difficult, this EOS functionality was added to Kafka Streams.)

I hope that helps.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=157486&siteId=1