1. Background
Previous introduces the function merge operator. This is an article on how to filter out from a Kafka Streams in those events recurring, leaving only those unique event.
2. Description Demo
Suppose we want to perform de-duplication logic of events in the following format:
{"ip":"10.0.0.1","url":"https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html","timestamp":"2019-09-16T14:53:43+00:00"}
Each event is still to be serialized by Protocol Buffer, it consists of three parts: ip + url + timestamp
3. Configuration Items
First, create a project path
$ mkdir distinct-events && cd distinct-events
Then, create a Gradle profile build.gradle in distinct-events directory, as follows:
buildscript { repositories { JCenter () } dependencies { classpath 'com.github.jengelman.gradle.plugins:shadow:4.0.2' } } plugins { id 'java' id "com.google.protobuf" version "0.8.10" } apply plugin: 'com.github.johnrengelman.shadow' repositories { mavenCentral() JCenter () maven { url 'http://packages.confluent.io/maven' } } group 'huxihx.kafkastreams' sourceCompatibility = 1.8 targetCompatibility = '1.8' version = '0.0.1' dependencies { an implementation 'com.google.protobuf: protobuf-java: 3.0.0' implementation 'org.slf4j:slf4j-simple:1.7.26' implementation 'org.apache.kafka:kafka-streams:2.3.0' an implementation 'com.google.protobuf: protobuf-java: 3.9.1' testCompile group: 'junit', name: 'junit', version: '4.12' } protobuf { generatedFilesBaseDir = "$projectDir/src/" protoc { artifact = 'com.google.protobuf:protoc:3.0.0' } } jar { manifest { attributes( 'Class-Path': configurations.compile.collect { it.getName() }.join(' '), 'Main-Class': 'huxihx.kafkastreams.FindDistinctEvents' ) } } shadowJar { archiveName = "kstreams-transform-standalone-${version}.${extension}" }
Note that we set the main class name is huxihx.kafkastreams.FindDistinctEvents.
Save the above file, and then execute the following command to download the Gradle wrapper package:
$ gradle wrapper
Once this is done, we create a subdirectory named configuration in the distinct-events directory that holds our parameter configuration file dev.properties:
$ mkdir configuration
application.id=find-distinct-app bootstrap.servers=localhost:9092 input.topic.name=clicks input.topic.partitions=1 input.topic.replication.factor=1 output.topic.name=distinct-clicks output.topic.partitions=1 output.topic.replication.factor=1
Here we configure an input and an output topic topic, news streams are stored after the input message stream and de-emphasis.
4. Create Message Schema
Next, create topic of schema used. File command in the distinct-events create a folder to save schema:
$ mkdir -p src/main/proto
After you create a file named click.proto in proto folder, as follows:
syntax = "proto3"; package huxihx.kafkastreams.proto; message Click { string ip = 1; string url = 2; string timestamp = 3; }
After saving operation gradlew command in distinct-events directory:
$ ./gradlew build
At this point, you should see the generated Java classes in distinct-events / src / main / java under / huxihx / kafkastreams / proto: ClickOuterClass.
5. Create Serdes
In this step we create each Serdes to the desired topic messages. First, execute the following command in the directory to create distinct-events corresponding folder directory:
$ mkdir -p src/main/java/huxihx/kafkastreams/serdes
After creating ProtobufSerializer.java in serdes newly created folder:
package huxihx.kafkastreams.serdes; import com.google.protobuf.MessageLite; import org.apache.kafka.common.serialization.Serializer; public class ProtobufSerializer<T extends MessageLite> implements Serializer<T> { @Override public byte[] serialize(String topic, T data) { return data == null ? new byte[0] : data.toByteArray(); } }
Then ProtobufDeserializer.java:
package huxihx.kafkastreams.serdes; import com.google.protobuf.InvalidProtocolBufferException; import com.google.protobuf.MessageLite; com.google.protobuf.Parser import; import org.apache.kafka.common.errors.SerializationException; import org.apache.kafka.common.serialization.Deserializer; import java.util.Map; public class ProtobufDeserializer<T extends MessageLite> implements Deserializer<T> { private Parser<T> parser; @Override public void configure(Map<String, ?> configs, boolean isKey) { parser = (Parser<T>) configs.get("parser"); } @Override public T deserialize(String topic, byte[] data) { try { return parser.parseFrom(data); } catch (InvalidProtocolBufferException e) { throw new SerializationException("Failed to deserialize from a protobuf byte array.", e); } } }
Finally ProtobufSerdes.java:
package huxihx.kafkastreams.serdes; import com.google.protobuf.MessageLite; com.google.protobuf.Parser import; import org.apache.kafka.common.serialization.Deserializer; import org.apache.kafka.common.serialization.Serde; import org.apache.kafka.common.serialization.Serializer; import java.util.HashMap; import java.util.Map; public class ProtobufSerdes<T extends MessageLite> implements Serde<T> { private final Serializer<T> serializer; private final Deserializer<T> deserializer; public ProtobufSerdes(Parser<T> parser) { serializer = new ProtobufSerializer<>(); deserializer = new ProtobufDeserializer<>(); Map<String, Parser<T>> config = new HashMap<>(); config.put("parser", parser); deserializer.configure(config, false); } @Override public Serializer<T> serializer() { return serializer; } @Override public Deserializer<T> deserializer() { return deserializer; } }
6. Develop the main flow
First create DeduplicationTransformer.java in src / main / java / huxihx / kafkastreams. The Java class for implementing de-duplication logic:
package huxihx.kafkastreams; import org.apache.kafka.streams.KeyValue; import org.apache.kafka.streams.kstream.KeyValueMapper; import org.apache.kafka.streams.kstream.Transformer; import org.apache.kafka.streams.processor.ProcessorContext; import org.apache.kafka.streams.state.WindowStore; import org.apache.kafka.streams.state.WindowStoreIterator; /** * The weight ip address logic to perform * @param <K> * @param <V> * @param <E> */ public class DeduplicationTransformer<K, V, E> implements Transformer<K, V, KeyValue<K, V>> { private static final String storeName = "eventId-store"; private ProcessorContext context; private WindowStore<E, Long> eventIdStore; private final long leftDurationMs; private final long rightDurationMs; private final KeyValueMapper<K, V, E> idExtractor; DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) { if (maintainDurationPerEventInMs < 1) { throw new IllegalArgumentException("maintain duration per event must be >= 1"); } leftDurationMs = maintainDurationPerEventInMs / 2; rightDurationMs = maintainDurationPerEventInMs - leftDurationMs; this.idExtractor = idExtractor; } @Override public void init(ProcessorContext context) { this.context = context; eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName); } @Override public KeyValue<K, V> transform(K key, V value) { final E eventId = idExtractor.apply(key, value); if (eventId == null) { return KeyValue.pair(key, value); } else { final KeyValue<K, V> output; if (isDuplicate(eventId)) { output = null; updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp()); } else { output = KeyValue.pair(key, value); rememberNewEvent(eventId, context.timestamp()); } return output; } } private boolean isDuplicate(final E eventId) { final long eventTime = context.timestamp(); final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch( eventId, eventTime - leftDurationMs, eventTime + rightDurationMs); final boolean isDuplicate = timeIterator.hasNext(); timeIterator.close(); return isDuplicate; } private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) { eventIdStore.put(eventId, newTimestamp, newTimestamp); } private void rememberNewEvent(final E eventId, final long timestamp) { eventIdStore.put(eventId, timestamp, timestamp); } @Override public void close() { } }
Then, create FindDistinctEvents.java file in src / main / java / huxihx / kafkastreams:
package huxihx.kafkastreams; import huxihx.kafkastreams.proto.ClickOuterClass; import huxihx.kafkastreams.serdes.protobufserdes; import org.apache.kafka.clients.admin.AdminClient; import org.apache.kafka.clients.admin.AdminClientConfig; import org.apache.kafka.clients.admin.NewTopic; import org.apache.kafka.clients.admin.TopicListing; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.kstream.Consumed; import org.apache.kafka.streams.kstream.Produced; import org.apache.kafka.streams.state.StoreBuilder; import org.apache.kafka.streams.state.Stores; import org.apache.kafka.streams.state.WindowStore; import java.io.FileInputStream; import java.io.IOException; import java.time.Duration; import java.util.ArrayList; import java.util.Collection; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.CountDownLatch; import java.util.stream.Collectors; public class FindDistinctEvents { private static final String storeName = "eventId-store"; public static void main(String[] args) throws Exception { if (args.length < 1) { throw new IllegalArgumentException("Config file path must be specified."); } FindDistinctEvents app = new FindDistinctEvents(); Properties envProps = app.loadEnvProperties(args[0]); Properties streamProps = app.createStreamsProperties(envProps); Topology topology = app.buildTopology(envProps); app.preCreateTopics(envProps); final KafkaStreams streams = new KafkaStreams(topology, streamProps); final CountDownLatch latch = new CountDownLatch(1); Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") { @Override public void run() { streams.close(); latch.countDown(); } }); try { streams.start(); latch.await(); } catch (Exception e) { System.exit(1); } System.exit(0); } private Topology buildTopology(Properties envProps) { final StreamsBuilder builder = new StreamsBuilder(); final ProtobufSerdes<ClickOuterClass.Click> clickSerdes = clickProtobufSerdes(); final String inputTopic = envProps.getProperty("input.topic.name"); final String outputTopic = envProps.getProperty("output.topic.name"); final Duration windowSize = Duration.ofMinutes(2); final StoreBuilder<WindowStore<String, Long>> dedupStoreBuilder = Stores.windowStoreBuilder( Stores.persistentWindowStore(storeName, windowSize, windowSize, false ), Serdes.String(), Serdes.Long()); builder.addStateStore(dedupStoreBuilder); builder.stream(inputTopic, Consumed.with(Serdes.String(), clickSerdes)) .transform(() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value.getIp()), storeName) .to(outputTopic, Produced.with(Serdes.String(), clickSerdes)); return builder.build(); } private Properties createStreamsProperties(Properties envProps) { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id")); props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers")); props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass()); props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass()); return props; } private void preCreateTopics(Properties envProps) throws Exception { Map<String, Object> config = new HashMap<>(); config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers")); String inputTopic = envProps.getProperty("input.topic.name"); String outputTopic = envProps.getProperty("output.topic.name"); try (AdminClient client = AdminClient.create(config)) { Collection<TopicListing> existingTopics = client.listTopics().listings().get(); List<NewTopic> topics = new ArrayList<>(); List<String> topicNames = existingTopics.stream().map(TopicListing::name).collect(Collectors.toList()); if (!topicNames.contains(inputTopic)) topics.add(new NewTopic( inputTopic, Integer.parseInt(envProps.getProperty("input.topic.partitions")), Short.parseShort(envProps.getProperty("input.topic.replication.factor")))); if (!topicNames.contains(outputTopic)) topics.add(new NewTopic( outputTopic, Integer.parseInt(envProps.getProperty("output.topic.partitions")), Short.parseShort(envProps.getProperty("output.topic.replication.factor")))); if (!topics.isEmpty()) client.createTopics(topics).all().get(); } } private Properties loadEnvProperties(String fileName) throws IOException { Properties envProps = new Properties(); try (FileInputStream input = new FileInputStream(fileName)) { envProps.load(input); } return envProps; } private static ProtobufSerdes<ClickOuterClass.Click> clickProtobufSerdes() { return new ProtobufSerdes<>(ClickOuterClass.Click.parser()); } }
The main logic buildTopology method, we used a custom DeduplicationTransformer 2 minutes to achieve a window of de-duplication logic.
7. Writing test Producer and Consumer
And before entry series, we write TestProducer and TestConsumer class. In src / main / java / huxihx / kafkastreams / tests / TestProducer.java and TestConsumer.java, contents are as follows:
TestProducer.java:
package huxihx.kafkastreams.tests; import huxihx.kafkastreams.proto.ClickOuterClass; import huxihx.kafkastreams.serdes.protobufserializ; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerConfig; import org.apache.kafka.clients.producer.ProducerRecord; import java.util.Arrays; import java.util.List; import java.util.Properties; public class TestProducer { private static final List<ClickOuterClass.Click> TEST_CLICK_EVENTS = Arrays.asList( ClickOuterClass.Click.newBuilder().setIp("10.0.0.1") .setUrl("https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html") .setTimestamp("2019-09-16T14:53:43+00:00").build(), ClickOuterClass.Click.newBuilder().setIp("10.0.0.2") .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen") .setTimestamp("2019-09-16T14:53:43+00:01").build(), ClickOuterClass.Click.newBuilder().setIp("10.0.0.3") .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen") .setTimestamp("2019-09-16T14:53:43+00:03").build(), ClickOuterClass.Click.newBuilder().setIp("10.0.0.1") .setUrl("https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html") .setTimestamp("2019-09-16T14:53:43+00:00").build(), ClickOuterClass.Click.newBuilder().setIp("10.0.0.2") .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen") .setTimestamp("2019-09-16T14:53:43+00:01").build(), ClickOuterClass.Click.newBuilder().setIp("10.0.0.3") .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen") .setTimestamp("2019-09-16T14:53:43+00:03").build() ); public static void main(String[] args) { Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); props.put(ProducerConfig.ACKS_CONFIG, "all"); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer"); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, new ProtobufSerializer<ClickOuterClass.Click>().getClass()); try (final Producer<String, ClickOuterClass.Click> producer = new KafkaProducer<>(props)) { TEST_CLICK_EVENTS.stream().map(click -> new ProducerRecord<String, ClickOuterClass.Click>("clicks", click)).forEach(producer::send); } } }
TestConsumer.java:
package huxihx.kafkastreams.tests; com.google.protobuf.Parser import; import huxihx.kafkastreams.proto.ClickOuterClass; import huxihx.kafkastreams.serdes.protobufdeserializ; import org.apache.kafka.clients.consumer.Consumer; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.clients.consumer.ConsumerRecord; import org.apache.kafka.clients.consumer.ConsumerRecords; import org.apache.kafka.clients.consumer.KafkaConsumer; import org.apache.kafka.common.serialization.Deserializer; import org.apache.kafka.common.serialization.StringDeserializer; import java.time.Duration; import java.util.Arrays; import java.util.HashMap; import java.util.Map; import java.util.Properties; public class TestConsumer { public static void main(String[] args) { Deserializer<ClickOuterClass.Click> deserializer = new ProtobufDeserializer<>(); Map<String, Parser<ClickOuterClass.Click>> config = new HashMap<>(); config.put("parser", ClickOuterClass.Click.parser()); deserializer.configure(config, false); Properties props = new Properties(); props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); props.put(ConsumerConfig.GROUP_ID_CONFIG, "test-group01"); props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000"); props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"); try (final Consumer<String, ClickOuterClass.Click> consumer = new KafkaConsumer<>(props, new StringDeserializer(), deserializer)) { consumer.subscribe(Arrays.asList("distinct-clicks")); while (true) { ConsumerRecords<String, ClickOuterClass.Click> records = consumer.poll(Duration.ofMillis(1000)); for (ConsumerRecord<String, ClickOuterClass.Click> record : records) { System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value()); } } } } }
8. Test
First we build the project, run the following command:
$ ./gradlew shadowJar
Kafka then start the cluster, run after Kafka Streams application:
$ java -jar build/libs/kstreams-transform-standalone-0.0.1.jar configuration/dev.properties
Now start the two terminals were tested Producer and Consumer:
$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestProducer
$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestConsumer
If all goes well, then TestConsumer should output three messages:
offset = 0, key = null, value = ip: "10.0.0.1"
url: "https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html"
timestamp: "2019-09-16T14:53:43+00:00"
offset = 1, key = null, value = ip: "10.0.0.2"
url: "https://www.confluent.io/hub/confluentinc/kafka-connect-datagen"
timestamp: "2019-09-16T14:53:43+00:01"
offset = 2, key = null, value = ip: "10.0.0.3"
url: "https://www.confluent.io/hub/confluentinc/kafka-connect-datagen"
timestamp: "2019-09-16T14:53:43+00:03"