Kafka Streams development of entry (6)

1. Background

Previous introduces the function merge operator. This is an article on how to filter out from a Kafka Streams in those events recurring, leaving only those unique event.

2. Description Demo

Suppose we want to perform de-duplication logic of events in the following format:

{"ip":"10.0.0.1","url":"https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html","timestamp":"2019-09-16T14:53:43+00:00"}

Each event is still to be serialized by Protocol Buffer, it consists of three parts: ip + url + timestamp

3. Configuration Items 

First, create a project path

$ mkdir distinct-events && cd distinct-events

Then, create a Gradle profile build.gradle in distinct-events directory, as follows:

buildscript {
  
    repositories {
        JCenter ()
    }
    dependencies {
        classpath 'com.github.jengelman.gradle.plugins:shadow:4.0.2'
    }
}
  
plugins {
    id 'java'
    id "com.google.protobuf" version "0.8.10"
}
apply plugin: 'com.github.johnrengelman.shadow'
  
  
repositories {
    mavenCentral()
    JCenter ()
  
    maven {
        url 'http://packages.confluent.io/maven'
    }
}
  
group 'huxihx.kafkastreams'
  
sourceCompatibility = 1.8
targetCompatibility = '1.8'
version = '0.0.1'
  
dependencies {
    an implementation 'com.google.protobuf: protobuf-java: 3.0.0'
    implementation 'org.slf4j:slf4j-simple:1.7.26'
    implementation 'org.apache.kafka:kafka-streams:2.3.0'
    an implementation 'com.google.protobuf: protobuf-java: 3.9.1'
  
    testCompile group: 'junit', name: 'junit', version: '4.12'
}
  
protobuf {
    generatedFilesBaseDir = "$projectDir/src/"
    protoc {
        artifact = 'com.google.protobuf:protoc:3.0.0'
    }
}
  
jar {
    manifest {
        attributes(
                'Class-Path': configurations.compile.collect { it.getName() }.join(' '),
                'Main-Class': 'huxihx.kafkastreams.FindDistinctEvents'
        )
    }
}
  
shadowJar {
    archiveName = "kstreams-transform-standalone-${version}.${extension}"
}

Note that we set the main class name is huxihx.kafkastreams.FindDistinctEvents.

Save the above file, and then execute the following command to download the Gradle wrapper package:

$ gradle wrapper

Once this is done, we create a subdirectory named configuration in the distinct-events directory that holds our parameter configuration file dev.properties:

$ mkdir configuration

application.id=find-distinct-app
bootstrap.servers=localhost:9092

input.topic.name=clicks
input.topic.partitions=1
input.topic.replication.factor=1

output.topic.name=distinct-clicks
output.topic.partitions=1
output.topic.replication.factor=1

Here we configure an input and an output topic topic, news streams are stored after the input message stream and de-emphasis.

4. Create Message Schema

Next, create topic of schema used. File command in the distinct-events create a folder to save schema:

$ mkdir -p src/main/proto

After you create a file named click.proto in proto folder, as follows:

syntax = "proto3";
  
package huxihx.kafkastreams.proto;
  
message Click {
    string ip = 1;
    string url = 2;
    string timestamp = 3;
}

After saving operation gradlew command in distinct-events directory:

$ ./gradlew build  

At this point, you should see the generated Java classes in distinct-events / src / main / java under / huxihx / kafkastreams / proto: ClickOuterClass.

5. Create Serdes

In this step we create each Serdes to the desired topic messages. First, execute the following command in the directory to create distinct-events corresponding folder directory:

$ mkdir -p src/main/java/huxihx/kafkastreams/serdes

After creating ProtobufSerializer.java in serdes newly created folder:

package huxihx.kafkastreams.serdes;
  
import com.google.protobuf.MessageLite;
import org.apache.kafka.common.serialization.Serializer;
  
public class ProtobufSerializer<T extends MessageLite> implements Serializer<T> {
    @Override
    public byte[] serialize(String topic, T data) {
        return data == null ? new byte[0] : data.toByteArray();
    }
}

Then ProtobufDeserializer.java:

package huxihx.kafkastreams.serdes;
  
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.MessageLite;
com.google.protobuf.Parser import;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Deserializer;
  
import java.util.Map;
  
public class ProtobufDeserializer<T extends MessageLite> implements Deserializer<T> {
  
    private Parser<T> parser;
  
    @Override
    public void configure(Map<String, ?> configs, boolean isKey) {
        parser = (Parser<T>) configs.get("parser");
    }
  
    @Override
    public T deserialize(String topic, byte[] data) {
        try {
            return parser.parseFrom(data);
        } catch (InvalidProtocolBufferException e) {
            throw new SerializationException("Failed to deserialize from a protobuf byte array.", e);
        }
    }
}

Finally ProtobufSerdes.java:  

package huxihx.kafkastreams.serdes;
  
import com.google.protobuf.MessageLite;
com.google.protobuf.Parser import;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serializer;
  
import java.util.HashMap;
import java.util.Map;
  
public class ProtobufSerdes<T extends MessageLite> implements Serde<T> {
  
    private final Serializer<T> serializer;
    private final Deserializer<T> deserializer;
  
    public ProtobufSerdes(Parser<T> parser) {
        serializer = new ProtobufSerializer<>();
        deserializer = new ProtobufDeserializer<>();
        Map<String, Parser<T>> config = new HashMap<>();
        config.put("parser", parser);
        deserializer.configure(config, false);
    }
  
    @Override
    public Serializer<T> serializer() {
        return serializer;
    }
  
    @Override
    public Deserializer<T> deserializer() {
        return deserializer;
    }
}

6. Develop the main flow

First create DeduplicationTransformer.java in src / main / java / huxihx / kafkastreams. The Java class for implementing de-duplication logic:

package huxihx.kafkastreams;

import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.kstream.KeyValueMapper;
import org.apache.kafka.streams.kstream.Transformer;
import org.apache.kafka.streams.processor.ProcessorContext;
import org.apache.kafka.streams.state.WindowStore;
import org.apache.kafka.streams.state.WindowStoreIterator;

/**
 * The weight ip address logic to perform
 * @param <K>
 * @param <V>
 * @param <E>
 */
public class DeduplicationTransformer<K, V, E> implements Transformer<K, V, KeyValue<K, V>> {

    private static final String storeName = "eventId-store";

    private ProcessorContext context;
    private WindowStore<E, Long> eventIdStore;

    private final long leftDurationMs;
    private final long rightDurationMs;

    private final KeyValueMapper<K, V, E> idExtractor;

    DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
        if (maintainDurationPerEventInMs < 1) {
            throw new IllegalArgumentException("maintain duration per event must be >= 1");
        }

        leftDurationMs = maintainDurationPerEventInMs / 2;
        rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
        this.idExtractor = idExtractor;
    }

    @Override
    public void init(ProcessorContext context) {
        this.context = context;
        eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
    }

    @Override
    public KeyValue<K, V> transform(K key, V value) {
        final E eventId = idExtractor.apply(key, value);
        if (eventId == null) {
            return KeyValue.pair(key, value);
        } else {
            final KeyValue<K, V> output;
            if (isDuplicate(eventId)) {
                output = null;
                updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
            } else {
                output = KeyValue.pair(key, value);
                rememberNewEvent(eventId, context.timestamp());
            }
            return output;
        }
    }

    private boolean isDuplicate(final E eventId) {
        final long eventTime = context.timestamp();
        final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
                eventId, eventTime - leftDurationMs, eventTime + rightDurationMs);
        final boolean isDuplicate = timeIterator.hasNext();
        timeIterator.close();
        return isDuplicate;
    }

    private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
        eventIdStore.put(eventId, newTimestamp, newTimestamp);
    }

    private void rememberNewEvent(final E eventId, final long timestamp) {
        eventIdStore.put(eventId, timestamp, timestamp);
    }

    @Override
    public void close() {

    }
}

Then, create FindDistinctEvents.java file in src / main / java / huxihx / kafkastreams:

package huxihx.kafkastreams;

import huxihx.kafkastreams.proto.ClickOuterClass;
import huxihx.kafkastreams.serdes.protobufserdes;
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.AdminClientConfig;
import org.apache.kafka.clients.admin.NewTopic;
import org.apache.kafka.clients.admin.TopicListing;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.state.StoreBuilder;
import org.apache.kafka.streams.state.Stores;
import org.apache.kafka.streams.state.WindowStore;

import java.io.FileInputStream;
import java.io.IOException;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
import java.util.stream.Collectors;

public class FindDistinctEvents {

    private static final String storeName = "eventId-store";

    public static void main(String[] args) throws Exception {
        if (args.length < 1) {
            throw new IllegalArgumentException("Config file path must be specified.");
        }

        FindDistinctEvents app = new FindDistinctEvents();
        Properties envProps = app.loadEnvProperties(args[0]);
        Properties streamProps = app.createStreamsProperties(envProps);
        Topology topology = app.buildTopology(envProps);

        app.preCreateTopics(envProps);

        final KafkaStreams streams = new KafkaStreams(topology, streamProps);
        final CountDownLatch latch = new CountDownLatch(1);

        Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
            @Override
            public void run() {
                streams.close();
                latch.countDown();
            }
        });

        try {
            streams.start();
            latch.await();
        } catch (Exception e) {
            System.exit(1);
        }
        System.exit(0);
    }

    private Topology buildTopology(Properties envProps) {
        final StreamsBuilder builder = new StreamsBuilder();
        final ProtobufSerdes<ClickOuterClass.Click> clickSerdes = clickProtobufSerdes();

        final String inputTopic = envProps.getProperty("input.topic.name");
        final String outputTopic = envProps.getProperty("output.topic.name");
        final Duration windowSize = Duration.ofMinutes(2);

        final StoreBuilder<WindowStore<String, Long>> dedupStoreBuilder = Stores.windowStoreBuilder(
                Stores.persistentWindowStore(storeName,
                        windowSize,
                        windowSize,
                        false
                ),
                Serdes.String(),
                Serdes.Long());

        builder.addStateStore(dedupStoreBuilder);
        builder.stream(inputTopic, Consumed.with(Serdes.String(), clickSerdes))
                .transform(() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value.getIp()), storeName)
                .to(outputTopic, Produced.with(Serdes.String(), clickSerdes));
        return builder.build();
    }

    private Properties createStreamsProperties(Properties envProps) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id"));
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        return props;
    }

    private void preCreateTopics(Properties envProps) throws Exception {
        Map<String, Object> config = new HashMap<>();
        config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
        String inputTopic = envProps.getProperty("input.topic.name");
        String outputTopic = envProps.getProperty("output.topic.name");
        try (AdminClient client = AdminClient.create(config)) {
            Collection<TopicListing> existingTopics = client.listTopics().listings().get();

            List<NewTopic> topics = new ArrayList<>();
            List<String> topicNames = existingTopics.stream().map(TopicListing::name).collect(Collectors.toList());
            if (!topicNames.contains(inputTopic))
                topics.add(new NewTopic(
                        inputTopic,
                        Integer.parseInt(envProps.getProperty("input.topic.partitions")),
                        Short.parseShort(envProps.getProperty("input.topic.replication.factor"))));

            if (!topicNames.contains(outputTopic))
                topics.add(new NewTopic(
                        outputTopic,
                        Integer.parseInt(envProps.getProperty("output.topic.partitions")),
                        Short.parseShort(envProps.getProperty("output.topic.replication.factor"))));

            if (!topics.isEmpty())
                client.createTopics(topics).all().get();
        }
    }

    private Properties loadEnvProperties(String fileName) throws IOException {
        Properties envProps = new Properties();
        try (FileInputStream input = new FileInputStream(fileName)) {
            envProps.load(input);
        }
        return envProps;
    }

    private static ProtobufSerdes<ClickOuterClass.Click> clickProtobufSerdes() {
        return new ProtobufSerdes<>(ClickOuterClass.Click.parser());
    }
}

The main logic buildTopology method, we used a custom DeduplicationTransformer 2 minutes to achieve a window of de-duplication logic.

7. Writing test Producer and Consumer

And before entry series, we write TestProducer and TestConsumer class. In src / main / java / huxihx / kafkastreams / tests / TestProducer.java and TestConsumer.java, contents are as follows:

 TestProducer.java:

 

package huxihx.kafkastreams.tests;

import huxihx.kafkastreams.proto.ClickOuterClass;
import huxihx.kafkastreams.serdes.protobufserializ;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Arrays;
import java.util.List;
import java.util.Properties;

public class TestProducer {
    private static final List<ClickOuterClass.Click> TEST_CLICK_EVENTS = Arrays.asList(
            ClickOuterClass.Click.newBuilder().setIp("10.0.0.1")
                    .setUrl("https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html")
                    .setTimestamp("2019-09-16T14:53:43+00:00").build(),
            ClickOuterClass.Click.newBuilder().setIp("10.0.0.2")
                    .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen")
                    .setTimestamp("2019-09-16T14:53:43+00:01").build(),
            ClickOuterClass.Click.newBuilder().setIp("10.0.0.3")
                    .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen")
                    .setTimestamp("2019-09-16T14:53:43+00:03").build(),
            ClickOuterClass.Click.newBuilder().setIp("10.0.0.1")
                    .setUrl("https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html")
                    .setTimestamp("2019-09-16T14:53:43+00:00").build(),
            ClickOuterClass.Click.newBuilder().setIp("10.0.0.2")
                    .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen")
                    .setTimestamp("2019-09-16T14:53:43+00:01").build(),
            ClickOuterClass.Click.newBuilder().setIp("10.0.0.3")
                    .setUrl("https://www.confluent.io/hub/confluentinc/kafka-connect-datagen")
                    .setTimestamp("2019-09-16T14:53:43+00:03").build()
    );

    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, new ProtobufSerializer<ClickOuterClass.Click>().getClass());

        try (final Producer<String, ClickOuterClass.Click> producer = new KafkaProducer<>(props)) {
            TEST_CLICK_EVENTS.stream().map(click -> new ProducerRecord<String, ClickOuterClass.Click>("clicks", click)).forEach(producer::send);
        }
    }
}

TestConsumer.java:

package huxihx.kafkastreams.tests;

com.google.protobuf.Parser import;
import huxihx.kafkastreams.proto.ClickOuterClass;
import huxihx.kafkastreams.serdes.protobufdeserializ;
import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.time.Duration;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;

public class TestConsumer {

    public static void main(String[] args) {
        Deserializer<ClickOuterClass.Click> deserializer = new ProtobufDeserializer<>();
        Map<String, Parser<ClickOuterClass.Click>> config = new HashMap<>();
        config.put("parser", ClickOuterClass.Click.parser());
        deserializer.configure(config, false);

        Properties props = new Properties();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "test-group01");
        props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000");
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        try (final Consumer<String, ClickOuterClass.Click> consumer = new KafkaConsumer<>(props, new StringDeserializer(), deserializer)) {
            consumer.subscribe(Arrays.asList("distinct-clicks"));
            while (true) {
                ConsumerRecords<String, ClickOuterClass.Click> records = consumer.poll(Duration.ofMillis(1000));
                for (ConsumerRecord<String, ClickOuterClass.Click> record : records) {
                    System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
                }
            }
        }
    }
}

8. Test

First we build the project, run the following command:

$ ./gradlew shadowJar  

Kafka then start the cluster, run after Kafka Streams application:

$ java -jar build/libs/kstreams-transform-standalone-0.0.1.jar configuration/dev.properties

Now start the two terminals were tested Producer and Consumer:

$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestProducer

 $ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestConsumer

 If all goes well, then TestConsumer should output three messages:

offset = 0, key = null, value = ip: "10.0.0.1"

url: "https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html"

timestamp: "2019-09-16T14:53:43+00:00"

 

offset = 1, key = null, value = ip: "10.0.0.2"

url: "https://www.confluent.io/hub/confluentinc/kafka-connect-datagen"

timestamp: "2019-09-16T14:53:43+00:01"

 

offset = 2, key = null, value = ip: "10.0.0.3"

url: "https://www.confluent.io/hub/confluentinc/kafka-connect-datagen"

timestamp: "2019-09-16T14:53:43+00:03"

 

Guess you like

Origin www.cnblogs.com/huxi2b/p/12154771.html