Develop stream processing applications using Kafka Streams

1. Introduction

1.1 Definition of Kafka Streams

Kafka Streams is an open source, distributed, and horizontally scalable stream processing platform built on top of Apache Kafka, which enables efficient stream processing applications with its high performance, scalability, and fault tolerance.

1.2 Advantages of Kafka Streams

Advantages of Kafka Streams include:

  • Based on the Kafka ecosystem, it can be more easily integrated into the existing Kafka environment.
  • It is easy to deploy and manage, and can easily realize automatic deployment and operation and maintenance through container technologies such as Docker.
  • For streaming data processing tasks, Kafka Streams has higher performance than other frameworks.

1.3 Kafka Streams Application Scenarios

Kafka Streams is mainly used in the following application scenarios:

  • Real-time data processing: Rapid analysis and processing of data through real-time streaming computing.
  • Streaming ETL: Extract data from one data source to another, or perform transformation, cleaning, and aggregation operations on data.
  • Stream-Table Join: Associate query between a stream data and a table to realize real-time query and joint analysis.

2. Environment construction

2.1 Install Kafka

Download the binary package of Kafka from the official website and use it after decompression. For the installation process, please refer to the official documentation.

2.2 Install Kafka Streams

Add the following dependencies to the pom.xml or build.gradle file of the Maven or Gradle project to install Kafka Streams:

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-streams</artifactId>
    <version>2.8.0</version>
</dependency>

2.3 Building a Kafka cluster

To build a Kafka cluster, you can use tools such as Docker Compose to implement automated deployment. For the test environment, a single machine can be used to build a multi-node Kafka cluster; for the production environment, the cluster size needs to be determined according to business requirements and QPS and other indicators.

3. Introduction to Kafka Streams Programming API

3.1 Kafka Streams main API

Kafka Streams is a Java API that allows users to transform and process streaming data using simple Java functions. Kafka Streams mainly includes the following APIs:

  • StreamBuilder: Used to build topology for Kafka streams.
  • KStream and KTable: You can convert messages in Kafka topics into key-value pair streams or tables.
  • GlobalKTable: Similar to KTable, but with global state across all partitions.
  • Serializer and Deserializer: Used to serialize and deserialize Java objects for writing and reading Kafka streams.
  • Processor and Transformer: for customizing operations and transforming streams.

3.2 Configuration and parameters of the application

In a Kafka Streams application, the following parameters can be used to configure the behavior of the application:

  • Bootstrapping Servers: Specifies the bootstrapping server for the Kafka cluster.
  • Application ID: Each application must have a unique ID.
  • Serde Configuration: Used to specify how to serialize and deserialize record keys and record values.
  • Cache Size Control: Used to control the size of the application's local cache.
  • Rule configuration: used to specify the semantics of consuming and producing data.

3.3 Definition and Construction of Topology

Topology is the physical representation of data streams in a Kafka Streams application. It is a topology composed of Processors and State Stores. Each Processor represents a data flow operation, and the State Store represents a storage device with local state. Topology builds can be manipulated using the StreamBuilder API.

3.4 Use of various data processing operations (map, filter, flatmap, etc.)

The Kafka Streams API provides various common data processing operations in order to process streaming data. The following are some basic data processing operations:

  • map: Used to convert one record key-value pair to another key-value pair.
  • filter: Used to filter out records based on certain conditions.
  • flatMap: used to map a key-value pair to multiple key-value pairs.
  • groupByKey: Group records by key.
  • reduceByKey: Used for aggregation operations on records with the same key.
  • aggregateByKey: Used to aggregate records for the same key and insert the result into the global state store.

The sample code is as follows:

//定义并构建拓扑结构
StreamBuilder builder = new StreamBuilder();
KStream<String, String> textLines = builder.stream("TextLinesTopic");
KStream<String, String> wordCounts = textLines
                            .flatMapValues(textLine -> Arrays.asList(textLine.split("\\W+")))
                            .groupBy((key, word) -> word)
                            .count();
wordCounts.to("WordsWithCountsTopic", Produced.with(stringSerde, longSerde));

//进行map操作
KStream<String, String> upperCaseLines = textLines.map((key, value) -> KeyValue.pair(key, value.toUpperCase()));

//进行filter操作
KStream<String, String> shortLines = textLines.filter((key, value) -> value.length() < 10);

//进行reduceByKey操作
KTable<String, Long> wordCountTable = textLines.flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("\\W+")))
                            .groupBy((key, word) -> word)
                            .count(Materialized.as("wordCountStore"));

4. Practical cases of stream processing

4.1 Development steps of stream processing application

Step 1: Create a Kafka Streams instance

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "my-stream-processing-application");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
StreamsBuilder builder = new StreamsBuilder();
KafkaStreams streams = new KafkaStreams(builder.build(), props);
  • props.put(StreamsConfig.APPLICATION_ID_CONFIG, "my-stream-processing-application")Specifies a unique identifier for the stream processing application.
  • props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")Specify the starting address of the Kafka cluster.
  • StreamsBuilder builder = new StreamsBuilder()Create an instance of StreamsBuilder and use it to build TOPOLOGY.

Step 2: Define input and output topics

final String inputTopic = "streams-input";
final String outputTopic = "streams-output";
KStream<String, String> inputStream = builder.stream(inputTopic);
KStream<String, String> outputStream = inputStream.mapValues(value -> value.toUpperCase());
outputStream.to(outputTopic);
  • builder.stream(inputTopic)Reads messages from inputTopicthe topic named with a return type of KStream<String, String>.
  • inputStream.mapValues(value -> value.toUpperCase())inputStreamProcess each message in and write the result outputStream.
  • outputStream.to(outputTopic)Writes outputStreamall messages from to outputTopica topic named .

Step 3: Start the Kafka Streams instance

streams.start();
  • streams.start()Start a Kafka Streams instance and start processing messages.

4.2 Event log monitoring case

Scenario description:

Suppose our backend service is sending HTTP request log information to the Kafka topic in the form of a JSON object every minute, where the data format is:

{
    
    
    "timestamp" : "2019-01-02T13:54:34.123Z",
    "method": "GET",
    "endpoint": "http://localhost:8080/api/v1/users",
    "status_code": 200,
    "response_time': 23.4
}

Now we need to visualize user request logs in real time, the update format is as follows:

{
    
    
    “time”:2019-01-02 14:30:22,
    “users”: [
        {
    
    “Country”:CA, ”Count”: 60},
        {
    
    “Country”:US, “Count”: 38},
        {
    
    “Country”:CN, “Count”: 6},
    ]
}

solution:

Build a stream processing application using Kafka Streams to preprocess request log entries. Aggregate and transform the logs as needed (such as grouping and counting by country) and write the results out to an output topic called a Kafka topic. Finally, once a new output item is written in the stream processing application, it can be read from the output topic and consumed using any tool available for visualization.

4.3 Cases of User Behavior Statistics

Scenario description:

Suppose we are using Kafka topics to collect user events from a mobile application. Each event must record three main attributes: the timestamp the event occurred, the hour the timestamp corresponds to, and the user type.

{
    
     
    "timestamp": 1517791088000, 
    "hour_of_day": 7, 
    "user_type": "bronze" 
}

Now, we need to aggregate these events in real time to understand user behavior such as the total number of users visiting per hour and the number of users across all different metal grades.

solution:

Use Kafka Streams to build a stream processing application that takes events from source topics as input and assembles output results into target topics.

KStream<String, String> input stream = builder.stream("user_events");
KTable<Windowed<String>, Long> hourlyUserCounts = inputstream
        .map((key, value) -> new KeyValue<>(parseTimestamp(value).toString("yyyyMMddHH"), value))
        .groupByKey()
        .count(TimeWindows.of(Duration.ofHours(1)));
KTable<Windowed<String>, Long> userCountsByType = inputstream
        .groupByKey()
        .count()
        .groupBy((key, value) -> key.split(":")[0], Grouped.with(Serdes.String(), Serdes.Long()))
        .reduce((v1, v2) -> v1 + v2);

hourlyUserCounts.toStream().to("hourly_user_counts", Produced.with(stringSerde, longSerde):
userCountsByType.toStream().to("user_counts_by_type", Produced.with(stringSerde, longSerde));

The above sample code converts the received user event data (that is, user_eventsthe messages in the topic) yyyyMMddHHinto the key of the time window, and then performs aggregation counting. Finally, count all users in a similar fashion and write the code to write it out to two other topics.

5. Performance optimization

5.1 How to evaluate the performance of Kafka Streams application

Evaluating the performance of Kafka Streams applications requires attention to the following aspects:

5.1.1 Throughput

Throughput refers to the number of messages processed by the Kafka Streams application per unit time. Throughput can be evaluated by the following metrics:

  • Input rate: The number of messages sent by the Kafka cluster to the Kafka Streams application per second.
  • Processing Latency: The time required from when a message arrives at the Kafka Streams application to when processing is complete.
  • Processing rate: The number of messages processed by the Kafka Streams application per second.

5.1.2 Latency

Latency refers to the time it takes for a Kafka Streams application to process a message. Latency can be assessed by the following metrics:

  • Maximum Latency: The maximum time it takes for a Kafka Streams application to process a message.
  • Average Latency: The average time it takes for a Kafka Streams application to process a message.

5.1.3 Memory usage

Memory usage refers to the amount of memory used by the Kafka Streams application. Memory usage can be evaluated by the following metrics:

  • Heap memory usage: The proportion of the Java heap space that has been used.
  • Non-heap memory usage: the proportion of Java non-heap space used.
  • GC time: The time required for Java garbage collection.

5.2 Optimizing parallelism and throughput

In order to improve the parallelism and throughput of Kafka Streams applications, the following optimization methods can be used:

5.2.1 Adjust the thread pool size

The Kafka Streams application uses a thread pool to process messages, and the parallelism and throughput can be improved by increasing the thread pool size.

// 创建线程池,指定线程池大小为10
ExecutorService executorService = Executors.newFixedThreadPool(10);

// 提交任务到线程池
for (int i = 0; i < 1000; i++) {
    
    
    executorService.submit(new Runnable() {
    
    
        public void run() {
    
    
            // 处理消息的逻辑
        }
    });
}

5.2.2 Adjust the number of partitions

Dividing a topic into multiple partitions can improve the parallelism and throughput of Kafka Streams applications. The number of partitions can be adjusted by the following command:

bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic mytopic --partitions 10

5.2.3 Using compression algorithms

The use of compression algorithms can reduce the amount of data in the transmission process of Kafka Streams applications, thereby improving throughput and reducing latency. The compression algorithm can be configured in the Kafka Streams application:

// 创建Streams配置对象
Properties streamsConfig = new Properties();
// 配置默认的压缩算法为gzip
streamsConfig.put(StreamsConfig.COMPRESSION_TYPE_CONFIG, "gzip");

5.3 Implement data compression

To implement data compression in Kafka Streams applications, messages can be compressed and decompressed using the Gzip compression algorithm:

import java.util.Base64;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

public class GzipUtils {
    
    

    public static String compress(String str) {
    
    
        try {
    
    
            if (str == null || str.length() == 0) {
    
    
                return str;
            } else {
    
    
                ByteArrayOutputStream out = new ByteArrayOutputStream();
                GZIPOutputStream gzip = new GZIPOutputStream(out);
                gzip.write(str.getBytes());
                gzip.close();
                byte[] compressed = out.toByteArray();
                out.close();
                return Base64.getEncoder().encodeToString(compressed);
            }
        } catch (IOException e) {
    
    
            throw new RuntimeException(e);
        }
    }

    public static String uncompress(String str) {
    
    
        try {
    
    
            if (str == null || str.length() == 0) {
    
    
                return str;
            } else {
    
    
                byte[] compressed = Base64.getDecoder().decode(str);
                ByteArrayInputStream in = new ByteArrayInputStream(compressed);
                GZIPInputStream gzip = new GZIPInputStream(in);
                ByteArrayOutputStream out = new ByteArrayOutputStream();
                byte[] buffer = new byte[4096];
                int bytesRead = -1;
                while ((bytesRead = gzip.read(buffer)) > 0) {
    
    
                    out.write(buffer, 0, bytesRead);
                }
                gzip.close();
                in.close();
                out.close();
                return new String(out.toByteArray(), "UTF-8");
            }
        } catch (IOException e) {
    
    
            throw new RuntimeException(e);
        }
    }

}

Example usage:

// 压缩字符串
String compressedStr = GzipUtils.compress("hello world");
// 解压缩字符串
String uncompressedStr = GzipUtils.uncompress(compressedStr);

Note: The above code implements the compression and decompression functions of the Gzip algorithm. Use to compress the message when compressing , use to decompress the message java.util.zip.GZIPOutputStreamwhen decompressing , and use to encode and decode the compressed byte array.java.util.zip.GZIPInputStreamjava.util.Base64

6. Application in production

Kafka Streams is a distributed stream processing framework that can easily process real-time data. When applying Kafka Streams in production, you need to pay attention to the following aspects.

6.1 High availability cluster deployment

To ensure high availability of Kafka Streams in production, we need to deploy it in a high availability cluster. This means that Kafka Streams needs to have multiple instances running, i.e. multiple Kafka Streams application instances. These instances should be distributed across multiple physical or virtual machines to avoid single points of failure.

The following is an example of a Java-based Kafka Streams high-availability cluster deployment:

    Properties properties = new Properties();
    properties.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "my-kafka-streams-app");
    properties.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
    properties.setProperty(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    properties.setProperty(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());

    KafkaStreams streams = new KafkaStreams(topology, properties);
    streams.start();

6.2 Monitoring and alarming

In the production environment, when the Kafka Streams application fails or is abnormal, we need to be notified in time and take corresponding measures. Therefore, it is very important to monitor Kafka Streams.

For example, we can use the properties provided by Kafka Streams StreamsConfig.STATE_DIR_CONFIGto store state in the local file system for restoration in case of errors. In addition, we can also use some open source monitoring tools, such as Prometheus and Grafana, to monitor the health of the Kafka Streams application and send alarm information.

The following is an example of Java-based Kafka Streams monitoring and alerting:

    Properties properties = new Properties();
    properties.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "my-kafka-streams-app");
    properties.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
    properties.setProperty(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    properties.setProperty(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    properties.setProperty(StreamsConfig.STATE_DIR_CONFIG, "/tmp/kafka-streams");

    KafkaStreams streams = new KafkaStreams(topology, properties);
    streams.start();

    // 使用Prometheus和Grafana进行监控并发送报警信息
    MonitoringInterceptorUtils monitoringInterceptorUtils = new MonitoringInterceptorUtils();
    monitoringInterceptorUtils.register(streams);

6.3 Log Management

In the production environment, we need to manage the logs of the Kafka Streams application. If we don't handle logs carefully, it can negatively impact performance and make troubleshooting impossible.

In order to manage the logs of Kafka Streams application, we can record them to files or log collection systems such as ELK or Graylog for better analysis and debugging.

The following is an example of Java-based Kafka Streams log management:

    Properties properties = new Properties();
    properties.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "my-kafka-streams-app");
    properties.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
    properties.setProperty(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    properties.setProperty(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    properties.setProperty(StreamsConfig.STATE_DIR_CONFIG, "/tmp/kafka-streams");

    KafkaStreams streams = new KafkaStreams(topology, properties);
    streams.start();

    // 将日志记录到文件中
    Appender fileAppender = RollingFileAppender.newBuilder()
            .setName("fileLogger")
            .setFileName("/tmp/kafka-streams.log")
            .build();
    fileAppender.start();

    LoggerContext context = (LoggerContext) LogManager.getContext(false);
    Configuration config = context.getConfiguration();
    config.addAppender(fileAppender);
    AppenderRef ref = AppenderRef.createAppenderRef("fileLogger", null, null);
    AppenderRef[] refs = new AppenderRef[] {
    
    ref};
    LoggerConfig loggerConfig = LoggerConfig.createLogger(false, Level.INFO, "my.kafkastreams", "true", refs, null, config, null);
    loggerConfig.addAppender(fileAppender, null, null);
    config.addLogger("my.kafkastreams", loggerConfig);
    context.updateLoggers();

Guess you like

Origin blog.csdn.net/u010349629/article/details/130931817