Kafka Streams development introduction (9)

1. Background

The previous article introduced how to use Kafka Streams to real-time statistics of the most popular and least popular movie box office in a certain year. The main method is to realize the max / min operator through the aggregate method provided by Streams. Today I bring you how to use the time window function. In Kafka Streams, there are three types of time windows: fixed time window (Tumbling Window), sliding time window (Sliding Window) and session window (Session Window). We do not discuss in detail the definitions and differences of these three types of windows, but directly use a project to illustrate how to use Tumbling Window to periodically count the number of scorers for each movie.

2. Functional demonstration

In this article we will create a Kafka topic to represent movie scoring events. Then we write a program to count the number of scorers received by each movie in each time window. We still use ProtocolBuffer to serialize message events. The JSON format of the event is as follows:

{"title": "Die Hard", "release_year": 1998, "rating": 8.2, "timestamp": "2019-04-25T18:00:00-0700"}

The meaning of the field is clear at a glance and will not be repeated here.

The whole program counts the number of movie scoring people in different time windows in real time, for example, the output is like this:

[Die Hard@1556186400000/1556187000000]	1
[Die Hard@1556186400000/1556187000000]	2
[Die Hard@1556186400000/1556187000000]	3
[Die Hard@1556186400000/1556187000000]	4
[Die Hard@1556188200000/1556188800000]	1

The output above shows that the Die Hard movie received 3 ratings in the first window and Die Hard received 1 rating in the second window.

3. Configuration items

The first step is to create the path of the project function, the command is as follows:

$ mkdir tumbling-windows && cd tumbling-windows

Then create a new Gradle configuration file build.gradle in the newly created tumbling-windows path, as follows:

buildscript {
   
    repositories {
        jcenter()
    }
    dependencies {
        classpath 'com.github.jengelman.gradle.plugins:shadow:4.0.2'
    }
}
   
plugins {
    id 'java'
    id "com.google.protobuf" version "0.8.10"
}
apply plugin: 'com.github.johnrengelman.shadow'
   
   
repositories {
    mavenCentral()
    jcenter()
   
    maven {
        url 'http://packages.confluent.io/maven'
    }
}
   
group 'huxihx.kafkastreams'
   
sourceCompatibility = 1.8
targetCompatibility = '1.8'
version = '0.0.1'
   
dependencies {
    implementation 'com.google.protobuf:protobuf-java:3.11.4'
    implementation 'org.slf4j:slf4j-simple:1.7.26'
    implementation 'org.apache.kafka:kafka-streams:2.4.0'
   
    testCompile group: 'junit', name: 'junit', version: '4.12'
}
   
protobuf {
    generatedFilesBaseDir = "$projectDir/src/"
    protoc {
        artifact = 'com.google.protobuf:protoc:3.11.4'
    }
}
   
jar {
    manifest {
        attributes(
                'Class-Path': configurations.compile.collect { it.getName() }.join(' '),
                'Main-Class': 'huxihx.kafkastreams.TumblingWindow'
        )
    }
}
   
shadowJar {
    archiveName = "kstreams-tumbling-windows-standalone-${version}.${extension}"
}  

The main class in project engineering is huxihx.kafkastreams.TumblingWindow.

Save the above file, and then execute the following command to download Gradle's wrapper suite:

$ gradle wrapper

After doing this, we create a sub-directory named configuration under the tumbling-windows directory to save our parameter configuration file dev.properties:

$ mkdir configuration
$ cd configuration
$ vi dev.properties

The contents of dev.properties are as follows:

application.id=tumbling-window-app

bootstrap.servers=localhost:9092

rating.topic.name=ratings
rating.topic.partitions=1
rating.topic.replication.factor=1

rating.count.topic.name=rating-counts
rating.count.topic.partitions=1
rating.count.topic.replication.factor=1

Here we created an input topic: ratings and an output topic: rating-counts. The former indicates movie scoring events, and the latter saves the number of scoring times for each movie in the time window.

4. Create a message schema

Since we use ProtocolBuffer for serialization, we need to generate Java classes in advance to model entity messages. We execute the following command under the tumbling-windows path to create a folder to save the schema:

$ mkdir -p src/main/proto && cd src/main/proto

Then create a file named rating.proto in the proto folder with the following content:

syntax = "proto3";
   
package huxihx.kafkastreams.proto;
   
message Rating {
    string title = 1;
    int32 release_year = 2;
    double rating = 3;
    string timestamp = 4;
} 

Since the output format is very simple, we will not generate a separate schema class for the output topic this time. After saving the above file, run the gradlew command in the tumbling-windows directory:

./gradlew build

At this point, you should see the generated Java class: RatingOuterClass under src / main / java / huxihx / kafkastreams / proto in tumbling-windows.

5. Create Serdes

In this step we create Serdes for the desired topic message. First execute the following command in the tumbling-windows directory to create the corresponding folder directory:

$ mkdir -p src/main/java/huxihx/kafkastreams/serdes  

Create ProtobufSerializer.java in the newly created serdes folder, as follows:

package huxihx.kafkastreams.serdes;
    
import com.google.protobuf.MessageLite;
import org.apache.kafka.common.serialization.Serializer;
    
public class ProtobufSerializer<T extends MessageLite> implements Serializer<T> {
    @Override
    public byte[] serialize(String topic, T data) {
        return data == null ? new byte[0] : data.toByteArray();
    }
}  

Next is to create ProtobufDeserializer.java:

package huxihx.kafkastreams.serdes;
    
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.MessageLite;
import com.google.protobuf.Parser;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Deserializer;
    
import java.util.Map;
    
public class ProtobufDeserializer<T extends MessageLite> implements Deserializer<T> {
    
    private Parser<T> parser;
    
    @Override
    public void configure(Map<String, ?> configs, boolean isKey) {
        parser = (Parser<T>) configs.get("parser");
    }
    
    @Override
    public T deserialize(String topic, byte[] data) {
        try {
            return parser.parseFrom(data);
        } catch (InvalidProtocolBufferException e) {
            throw new SerializationException("Failed to deserialize from a protobuf byte array.", e);
        }
    }
}

Finally, ProtobufSerdes.java:

package huxihx.kafkastreams.serdes;
    
import com.google.protobuf.MessageLite;
import com.google.protobuf.Parser;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serializer;
    
import java.util.HashMap;
import java.util.Map;
    
public class ProtobufSerdes<T extends MessageLite> implements Serde<T> {
    
    private final Serializer<T> serializer;
    private final Deserializer<T> deserializer;
    
    public ProtobufSerdes(Parser<T> parser) {
        serializer = new ProtobufSerializer<>();
        deserializer = new ProtobufDeserializer<>();
        Map<String, Parser<T>> config = new HashMap<>();
        config.put("parser", parser);
        deserializer.configure(config, false);
    }
    
    @Override
    public Serializer<T> serializer() {
        return serializer;
    }
    
    @Override
    public Deserializer<T> deserializer() {
        return deserializer;
    }
}

6. Main development process

For time windows, you must define a TimestampExtractor to tell Kafka Streams how to determine the time dimension. In this example, we need to create a TimestampExtractor to extract the time information in the event. We create a file named RatingTimestampExtractor.java under huxihx.kafkastreams:

package huxihx.kafkastreams;

import huxihx.kafkastreams.proto.RatingOuterClass;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.streams.processor.TimestampExtractor;

import java.text.ParseException;
import java.text.SimpleDateFormat;

public class RatingTimestampExtractor implements TimestampExtractor {

    @Override
    public long extract(ConsumerRecord<Object, Object> record, long partitionTime) {
        final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
        String eventTime = ((RatingOuterClass.Rating)record.value()).getTimestamp();
        try {
            return sdf.parse(eventTime).getTime();
        } catch (ParseException e) {
            return 0;
        }
    }
}  

The above code uses SimpleDateFormat to do time format conversion, convert the time in string form into a timestamp and return.

Next, we write the main program: TumblingWindow.java:

package huxihx.kafkastreams;

import huxihx.kafkastreams.proto.RatingOuterClass;
import huxihx.kafkastreams.serdes.ProtobufSerdes;
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.AdminClientConfig;
import org.apache.kafka.clients.admin.NewTopic;
import org.apache.kafka.clients.admin.TopicListing;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.Grouped;
import org.apache.kafka.streams.kstream.TimeWindows;
import org.apache.kafka.streams.kstream.Windowed;

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.text.SimpleDateFormat;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.TimeZone;
import java.util.concurrent.CountDownLatch;
import java.util.stream.Collectors;

public class TumblingWindow {

    public static void main(String[] args) throws Exception {
        if (args.length < 1) {
            throw new IllegalArgumentException("This program takes one argument: the path to an environment configuration file.");
        }

        new TumblingWindow().runRecipe(args[0]);
    }

    private Properties loadEnvProperties(String fileName) throws IOException {
        Properties envProps = new Properties();
        try (FileInputStream input = new FileInputStream(fileName)) {
            envProps.load(input);
        }
        return envProps;
    }

    private void runRecipe(final String configPath) throws Exception {
        Properties envProps = this.loadEnvProperties(configPath);
        Properties streamProps = this.createStreamsProperties(envProps);

        Topology topology = this.buildTopology(envProps);
        this.preCreateTopics(envProps);

        final KafkaStreams streams = new KafkaStreams(topology, streamProps);
        final CountDownLatch latch = new CountDownLatch(1);

        // Attach shutdown handler to catch Control-C.
        Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
            @Override
            public void run() {
                streams.close();
                latch.countDown();
            }
        });

        try {
            streams.start();
            latch.await();
        } catch (Throwable e) {
            System.exit(1);
        }
        System.exit(0);

    }


    private Topology buildTopology(final Properties envProps) {
        final StreamsBuilder builder = new StreamsBuilder();
        final String ratingTopic = envProps.getProperty("rating.topic.name");
        final String ratingCountTopic = envProps.getProperty("rating.count.topic.name");

        builder.stream(ratingTopic, Consumed.with(Serdes.String(), ratingProtobufSerdes()))
                .map((key, rating) -> new KeyValue<>(rating.getTitle(), rating))
                .groupByKey(Grouped.with(Serdes.String(), ratingProtobufSerdes()))
                .windowedBy(TimeWindows.of(Duration.ofMinutes(10)))
                .count()
                .toStream()
                .<String, String>map((Windowed<String> key, Long count) -> new KeyValue(windowedKeyToString(key), count.toString()))
                .to(ratingCountTopic);

        return builder.build();
    }


    private static void preCreateTopics(Properties envProps) throws Exception {
        Map<String, Object> config = new HashMap<>();
        config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
        String inputTopic = envProps.getProperty("rating.topic.name");
        String outputTopic = envProps.getProperty("rating.count.topic.name");
        Map<String, String> topicConfigs = new HashMap<>();
        topicConfigs.put("retention.ms", Long.toString(Long.MAX_VALUE));


        try (AdminClient client = AdminClient.create(config)) {
            Collection<TopicListing> existingTopics = client.listTopics().listings().get();

            List<NewTopic> topics = new ArrayList<>();
            List<String> topicNames = existingTopics.stream().map(TopicListing::name).collect(Collectors.toList());
            if (!topicNames.contains(inputTopic))
                topics.add(new NewTopic(
                        inputTopic,
                        Integer.parseInt(envProps.getProperty("rating.topic.partitions")),
                        Short.parseShort(envProps.getProperty("rating.topic.replication.factor"))).configs(topicConfigs));

            if (!topicNames.contains(outputTopic))
                topics.add(new NewTopic(
                        outputTopic,
                        Integer.parseInt(envProps.getProperty("rating.count.topic.partitions")),
                        Short.parseShort(envProps.getProperty("rating.count.topic.replication.factor"))).configs(topicConfigs));

            if (!topics.isEmpty())
                client.createTopics(topics).all().get();
        }
    }

    private Properties createStreamsProperties(Properties envProps) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id"));
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, RatingTimestampExtractor.class.getName());
        props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
        try {
            props.put(StreamsConfig.STATE_DIR_CONFIG,
                    Files.createTempDirectory("tumbling-windows").toAbsolutePath().toString());
        } catch (IOException ignored) {
        }
        return props;
    }

    private String windowedKeyToString(Windowed<String> key) {
        return String.format("[%s@%s/%s]", key.key(), key.window().start(), key.window().end());
    }

    private static ProtobufSerdes<RatingOuterClass.Rating> ratingProtobufSerdes() {
        return new ProtobufSerdes<>(RatingOuterClass.Rating.parser());
    }
}

7. Write Test Producer

Now create src / main / java / huxihx / kafkastreams / tests / TestProducer.java and TestConsumer.java for testing, the contents are as follows:  

package huxihx.kafkastreams.tests;

import huxihx.kafkastreams.proto.RatingOuterClass;
import huxihx.kafkastreams.serdes.ProtobufSerializer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Arrays;
import java.util.List;
import java.util.Properties;

public class TestProducer {
    private static final List<RatingOuterClass.Rating> TEST_EVENTS = Arrays.asList(
            RatingOuterClass.Rating.newBuilder().setTitle("Die Hard").setReleaseYear(1998).setRating(8.2)
                    .setTimestamp("2019-04-25T18:00:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Die Hard").setReleaseYear(1998).setRating(4.5)
                    .setTimestamp("2019-04-25T18:03:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Die Hard").setReleaseYear(1998).setRating(5.1)
                    .setTimestamp("2019-04-25T18:04:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Die Hard").setReleaseYear(1998).setRating(2.0)
                    .setTimestamp("2019-04-25T18:07:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Die Hard").setReleaseYear(1998).setRating(8.3)
                    .setTimestamp("2019-04-25T18:32:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Die Hard").setReleaseYear(1998).setRating(3.4)
                    .setTimestamp("2019-04-25T18:36:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Die Hard").setReleaseYear(1998).setRating(4.2)
                    .setTimestamp("2019-04-25T18:43:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Die Hard").setReleaseYear(1998).setRating(7.6)
                    .setTimestamp("2019-04-25T18:44:00-0700").build(),

            RatingOuterClass.Rating.newBuilder().setTitle("Tree of Life").setReleaseYear(2011).setRating(4.9)
                    .setTimestamp("2019-04-25T20:01:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Tree of Life").setReleaseYear(2011).setRating(5.6)
                    .setTimestamp("2019-04-25T20:02:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Tree of Life").setReleaseYear(2011).setRating(9.0)
                    .setTimestamp("2019-04-25T20:03:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Tree of Life").setReleaseYear(2011).setRating(6.5)
                    .setTimestamp("2019-04-25T20:12:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Tree of Life").setReleaseYear(2011).setRating(2.1)
                    .setTimestamp("2019-04-25T20:13:00-0700").build(),


            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1995).setRating(3.6)
                    .setTimestamp("2019-04-25T22:20:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1995).setRating(6.0)
                    .setTimestamp("2019-04-25T22:21:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1995).setRating(7.0)
                    .setTimestamp("2019-04-25T22:22:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1995).setRating(4.6)
                    .setTimestamp("2019-04-25T22:23:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1995).setRating(7.1)
                    .setTimestamp("2019-04-25T22:24:00-0700").build(),


            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1998).setRating(9.9)
                    .setTimestamp("2019-04-25T21:15:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1998).setRating(8.9)
                    .setTimestamp("2019-04-25T21:16:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1998).setRating(7.9)
                    .setTimestamp("2019-04-25T21:17:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1998).setRating(8.9)
                    .setTimestamp("2019-04-25T21:18:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1998).setRating(9.9)
                    .setTimestamp("2019-04-25T21:19:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("A Walk in the Clouds").setReleaseYear(1998).setRating(9.9)
                    .setTimestamp("2019-04-25T21:20:00-0700").build(),

            RatingOuterClass.Rating.newBuilder().setTitle("Super Mario Bros.").setReleaseYear(1993).setRating(3.5)
                    .setTimestamp("2019-04-25T13:00:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Super Mario Bros.").setReleaseYear(1993).setRating(4.5)
                    .setTimestamp("2019-04-25T13:07:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Super Mario Bros.").setReleaseYear(1993).setRating(5.5)
                    .setTimestamp("2019-04-25T13:30:00-0700").build(),
            RatingOuterClass.Rating.newBuilder().setTitle("Super Mario Bros.").setReleaseYear(1993).setRating(6.5)
                    .setTimestamp("2019-04-25T13:34:00-0700").build());

    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, new ProtobufSerializer<RatingOuterClass.Rating>().getClass());

        try (final Producer<String, RatingOuterClass.Rating> producer = new KafkaProducer<>(props)) {
            TEST_EVENTS.stream().map(event ->
                    new ProducerRecord<String, RatingOuterClass.Rating>("ratings", event)).forEach(producer::send);
        }
    }
}

8. Test

First we run the following command to build the project:

$ ./gradlew shadowJar  

Then start the Kafka cluster, and then run the Kafka Streams application:

$ java -jar build/libs/kstreams-transform-standalone-0.0.1.jar configuration/dev.properties

Now start a terminal test Producer:

$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestProducer

Then open another terminal and run a ConsoleConsumer command to verify the output:

$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic rating-counts --from-beginning --property print.key=true

 If everything is normal, you should see the following output:

[Die Hard@1556186400000/1556187000000]	1
[Die Hard@1556186400000/1556187000000]	2
[Die Hard@1556186400000/1556187000000]	3
[Die Hard@1556186400000/1556187000000]	4
[Die Hard@1556188200000/1556188800000]	1
[Die Hard@1556188200000/1556188800000]	2
[Die Hard@1556188800000/1556189400000]	1
[Die Hard@1556188800000/1556189400000]	2
[Tree of Life@1556193600000/1556194200000]	1
[Tree of Life@1556193600000/1556194200000]	2
[Tree of Life@1556193600000/1556194200000]	3
[Tree of Life@1556194200000/1556194800000]	1
[Tree of Life@1556194200000/1556194800000]	2
[A Walk in the Clouds@1556202000000/1556202600000]	1
[A Walk in the Clouds@1556202000000/1556202600000]	2
[A Walk in the Clouds@1556202000000/1556202600000]	3
[A Walk in the Clouds@1556202000000/1556202600000]	4
[A Walk in the Clouds@1556202000000/1556202600000]	5
[A Walk in the Clouds@1556197800000/1556198400000]	1
[A Walk in the Clouds@1556197800000/1556198400000]	2
[A Walk in the Clouds@1556197800000/1556198400000]	3
[A Walk in the Clouds@1556197800000/1556198400000]	4
[A Walk in the Clouds@1556197800000/1556198400000]	5
[A Walk in the Clouds@1556198400000/1556199000000]	1
[Super Mario Bros.@1556168400000/1556169000000]	1
[Super Mario Bros.@1556168400000/1556169000000]	2
[Super Mario Bros.@1556170200000/1556170800000]	1
[Super Mario Bros.@1556170200000/1556170800000]	2

The text in square brackets is the movie name, the first time stamp is the window start time, and the second time stamp is the window end time. 

Guess you like

Origin www.cnblogs.com/huxi2b/p/12672981.html