Apache Kafka Getting Started Tutorial

1. Introduction

Introduction

Apache Kafka is an open source stream processing platform developed by the Apache Software Foundation for processing real-time large-scale data streams. The goal of Kafka is to process live streaming data, including sensor data, website logs, internal application messages, and more. It can handle thousands of messages and let you process and store them quickly. In Kafka, producers are responsible for sending messages to Brokers in the Kafka cluster, and consumers subscribe and receive messages from Brokers.

architecture

Kafka's architecture consists of three parts: Producer, Broker and Consumer, and has the characteristics of high concurrency, high throughput and distribution. Producer can send messages to Broker, Consumer can subscribe and receive messages from Broker, and Broker can store multiple Topics. A Topic can have multiple Partitions, messages in Partitions can be managed through Offset, and messages in Kafka are stored in Append-only form.

2. Kafka installation and configuration

JDK

  1. Download the JDK, for example: jdk-8u291-linux-x64.tar.gz.
  2. Unzip the JDK to any directory, such as /usr/lib/jvm/jdk1.8.0_291.
  3. Configure environment variables, for example:
    $ export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_291
    $ export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
    $ export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH

Install Kafka

  1. Download Kafka, for example: kafka_2.12-2.8.0.tgz.
  2. Unzip Kafka to any directory, such as /opt/kafka.
  3. Modify the configuration file and modify the server.properties file as needed.

Configuration file details

Kafka's configuration file is located at config/server.properties. The following are some commonly used configuration items and their meanings:

  1. broker.id, the unique identifier of the broker.
  2. advertised.listeners, listen to the broker's client connection address and port.
  3. log.dirs, message storage file directory.
  4. zookeeper.connect, the ZooKeeper address and port to use.
  5. num.network.threads, the number of threads used to handle network requests.
  6. num.io.threads, the number of threads used to handle disk IO.
  7. socket.receive.buffer.bytes and socket.send.buffer.bytes are used to control the TCP buffer size.
  8. group.initial.rebalance.delay.ms, when a Consumer joins or leaves in the Consumer Group, how long is the delay before rebalancing.
  9. auto.offset.reset, how to set offset when Consumer Group consumes new Topic or Partition when the offset no longer exists, the default is latest.

3. Basic operations of Kafka

startup and shutdown

//启动Kafka
$KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties

//关闭Kafka
$KAFKA_HOME/bin/kafka-server-stop.sh

Topic creation and deletion

import kafka.admin.AdminUtils;
import kafka.utils.ZkUtils;

//创建Topic
String topicName = "test";
int numPartitions = 3;
int replicationFactor = 2;
Properties topicConfig = new Properties();
AdminUtils.createTopic(zkUtils, topicName, numPartitions, replicationFactor, topicConfig);

//删除Topic
AdminUtils.deleteTopic(zkUtils, topicName);

Partitions and Replication configuration

You can specify the number of Partitions and Replication Factor when creating a Topic. If you need to modify it, you can use the following command to modify it:

//修改Partitions数
bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic test --partitions 4

//修改Replication Factor
bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic test --replication-factor 3

How to use Producer and Consumer

Producer

import org.apache.kafka.clients.producer.*;

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++)
    producer.send(new ProducerRecord<String, String>("test", Integer.toString(i), Integer.toString(i)));

producer.close();

Consumer

import org.apache.kafka.clients.consumer.*;
import java.time.Duration;
import java.util.Collections;
import java.util.Properties;

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

Consumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("test"));
while (true) {
    
    
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records)
        System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}

4. Advanced application of Kafka

message reliability guarantee

The reliability guarantee of messages in Kafka is realized through two mechanisms: support replica mechanism and ISR (In-Sync Replicas) list.

  1. Support for replica mechanism
    The replica mechanism means that a partition (Partition) under a topic (Topic) can have multiple copies, each copy stores a complete message, one of which is designated as the leader copy, and the other copies are follower copies. When the producer sends a message to a partition, it only needs to send it to the leader copy, and the leader copy distributes the message to other follower copies, thus ensuring the reliability of the message. Even if a follower copy fails, it will not affect the consumption of messages, because other copies still store complete messages.

  2. ISR (In-Sync Replicas) list
    The ISR list refers to the list of all follower replicas that are currently in sync with the leader replica. When a follower copy lags behind the leader copy, it will be removed from the ISR list until it catches up with the leader copy and then added to the ISR list. This mechanism ensures the high availability of the Kafka cluster and the reliability of messages.

  3. At least once semantics
    Kafka guarantees At least once semantics by default, that is, "process at least once". This semantics can be guaranteed through repeated consumption of messages, but it will cause loss of processing efficiency. If you want to ensure that the message is only processed once, you can choose to use idempotence (Idempotence) or transaction mechanism.

Kafka Stream

Kafka Stream is a library based on the stream processing model in the Kafka ecosystem. It makes full use of the advantages of Kafka, such as high throughput, good scalability, high reliability, etc., supports real-time data stream processing and batch processing, and is also very rich in operators.

  1. Stream stream processing model
    Stream stream processing model is a model that converts input data streams into output data streams, and can complete real-time data processing. In Kafka Stream, the data stream consists of one record (Record), each record consists of a key (Key) and a value (Value). Through the proficiency of the Stream stream processing model, you can quickly develop an efficient and highly reliable stream processing program.

  2. Detailed explanation of operators Operator
    is the core concept in Kafka Stream and the most basic unit for transforming data streams. Kafka Stream provides a wealth of operators, including filters, mappers, aggregators, groupers, etc., developers can flexibly choose according to their needs. Among them, mappers and aggregators are the most commonly used operators, and they can complete various processing and transformations on data streams.

Kafka Connect

Kafka Connect is a tool in the Kafka ecosystem for integrating data into and from Kafka. It implements data transmission through Connector, and Kafka Connect can integrate various data sources and data destinations, such as files, databases, message queues, etc. Using Kafka Connect can quickly complete data import and export, and can realize effective data management and monitoring.

  1. Connector Quick Start Tutorial
    The use of Kafka Connect is very simple, you only need to write a Connector configuration file, and then start the Kafka Connect process. In the configuration file of the Connector, it is necessary to specify the configuration information of the data source and the data destination, and define how to read data from the data source and how to send the data to the data destination.

  2. Implementing a custom Connect
    If the built-in Connector of Kafka Connect cannot meet the requirements, developers can also customize the Connector to implement data import and export. Developers can refer to the implemented Connectors in the Kafka Connect source code for development, and improve their own Connector functions as needed. By customizing the Connector, developers can flexibly customize data access solutions that meet their business needs.

Five, Kafka cluster management

Deployment in a cluster environment

To deploy a Kafka cluster, follow these steps:

  1. Make sure that the operating systems of all nodes in the cluster are consistent, and CentOS 7 is recommended.
  2. Download and configure JDK, Kafka depends on the Java runtime environment.
  3. Download the Kafka installation package and decompress it to the specified directory.
  4. To modify the Kafka configuration file server.properties, the configuration items that need attention include the following:
    • broker.id: Indicates the ID of the current node, which must be unique among all nodes.
    • listeners: Used to set the Kafka binding address and port, where the port number needs to be unique on each node. It is recommended to use an IP address rather than a hostname as the listening address.
    • log.dirs: Indicates the path where the message log is saved. It is recommended to set it separately for each node to avoid data confusion caused by multiple nodes sharing a directory.
    • zookeeper.connect: Indicates the connection address of ZooKeeper, which is an important component of the Kafka cluster.

Operate and maintain the cluster

The operation and maintenance of Kafka cluster mainly includes the following aspects:

Monitoring and Alerting

The Kafka cluster should have a complete monitoring and alarm mechanism, which can detect and deal with abnormalities in the cluster in a timely manner, and prevent problems such as cluster downtime or data loss. Usually open source monitoring systems are used, such as Prometheus and Grafana.

Message backup and restore

In order to prevent message loss, the Kafka cluster needs to be configured with an appropriate backup strategy to ensure that messages are still available in the event of system failure or data center failure. Specifically, data can be backed up by using a multi-copy backup strategy or multi-active in different places, or by using related data backup tools.

Handling hot issues

If there is a consumption hotspot problem in the cluster, it needs to be checked in time. You can use the Consumer Lag tool that comes with Kafka or a third-party tool to analyze it, find out the cause of the hotspot and formulate a corresponding solution.

Cluster expansion and contraction

When the Kafka cluster cannot meet business needs or needs to optimize performance, we may need to expand or shrink the cluster.

Expansion operation

Capacity expansion can be done by increasing the number of nodes and adjusting multiple configuration items:

  1. Increase the number of nodes: The new nodes need to have the same environment configuration as other nodes in the cluster, including the operating system and Java version. After adding a node, you need to update server.propertiesthe file and restart the Kafka process to make the new node take effect. At the same time, partitions need to be reallocated and data migration performed.
  2. Adjust multiple configuration items: You can improve the performance of the Kafka cluster by adjusting the throughput of message production and consumption, expanding Broker resources, and increasing the number of replicas.

Shrink operation

Scaling can be done by reducing the number of nodes and deleting several configuration items:

  1. Reduce the number of nodes: It is necessary to first confirm whether there are redundant nodes. If there are redundant nodes, they can be shut down or removed from the cluster. At the same time, you need to update server.propertiesthe file and restart the Kafka process for the scaling to take effect. It should be noted that partitions need to be reassigned and data migration performed during node shrinkage.
  2. Delete multiple configuration items: You can reduce the size of the Kafka cluster by adjusting the message retention time and weakening the throughput of a single Broker.

Before expanding and shrinking, it is necessary to understand the current status and performance of the cluster through appropriate monitoring tools, and configure and adjust according to actual needs. Also use a backup strategy to ensure data integrity and availability.

6. Application cases

log collection

As a distributed message queue, Kafka can achieve efficient, reliable and low-latency processing in log collection. Here is a simple Java code example for sending syslogs to a Kafka cluster:

import org.apache.kafka.clients.producer.*;
import java.util.Properties;

public class KafkaLogProducer {
    
    
    private final KafkaProducer<String, String> producer;
    private final String topic;

    public KafkaLogProducer(String brokers, String topic) {
    
    
        Properties prop = new Properties();
        
        // 配置 Kafka 集群地址
        prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
        
        // 配置 key 和 value 的序列化器
        prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        
        this.producer = new KafkaProducer(prop);
        this.topic = topic;
    }

    public void sendLog(String message) {
    
    
        producer.send(new ProducerRecord<>(topic, message));
    }

    public void close() {
    
    
        producer.close();
    }
}

data synchronization

In addition to being a tool for log collection, Kafka can also be used for data synchronization. Using Kafka, data can be replicated from one system to another, and asynchronous and batch processing can be achieved. The following is a simple Java code example for synchronizing data from a source database to a target database:

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.clients.producer.*;
import java.sql.*;
import java.util.Properties;

public class KafkaDataSync {
    
    
    private final KafkaConsumer<String, String> consumer;
    private final KafkaProducer<String, String> producer;
    private final String sourceTopic;
    private final String targetTopic;

    public KafkaDataSync(String brokers, String sourceTopic, String targetTopic) {
    
    
        Properties prop = new Properties();
        
        // 配置 Kafka 集群地址
        prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
        
        // 配置 key 和 value 的序列化器
        prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        
        this.producer = new KafkaProducer(prop);

        // 配置消费者组
        Properties props = new Properties();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "group1");
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");

        consumer = new KafkaConsumer(props);
        consumer.subscribe(Arrays.asList(sourceTopic));
        this.sourceTopic = sourceTopic;
        this.targetTopic = targetTopic;
    }

    public void start() throws SQLException {
    
    
        while (true) {
    
    
            ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
            for (ConsumerRecord<String, String> record : records) {
    
    
                String message = record.value();
                // 将数据解析并同步到目标数据库
                syncData(message);
            }
        }
    }

    public void close() {
    
    
        consumer.close();
        producer.close();
    }

    private void syncData(String message) {
    
    
        // 数据同步逻辑代码
        // ...
        // 将同步后的数据发送到目标 Kafka Topic 中
        producer.send(new ProducerRecord<>(targetTopic, message));
    }

}

real-time processing

As a distributed stream processing platform, Kafka has powerful real-time processing capabilities. It can support a variety of real-time computing frameworks and processing engines, such as Apache Storm, Apache Flink, and Apache Spark. The following is a simple Kafka stream processing code example to count the number of logs in a specified time range:

import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;
import java.util.Properties;

public class KafkaStreamProcessor {
    
    
    public static void main(String[] args) {
    
    

        Properties props = new Properties();
        
        // 配置 Kafka 集群地址
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        
        // 配置 key 和 value 的序列化器和反序列化器
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());

        StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> messages = builder.stream("logs");

        // 统计指定时间范围内的日志数量
        KTable<Windowed<String>, Long> logsCount = messages
            .mapValues(log -> 1)
            .groupByKey()
            .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
            .count();

        logsCount.toStream().foreach((key, value) -> System.out.println(key.toString() + " -> " + value));

        KafkaStreams streams = new KafkaStreams(builder.build(), props);
        streams.start();
    }
}

7. Optimization and tuning

Performance Index Optimization

The performance of Kafka cluster is affected by many factors. In order to improve the performance of Kafka cluster, you need to pay attention to the following important performance indicators:

  • Message throughput: Refers to the number of messages that a Kafka cluster can process per second, depending on factors such as hardware configuration, network and disk speed, message size, and complexity.
  • Latency: Refers to the time interval between a message being sent from the producer and being received by the consumer, which mainly depends on network delay and disk I/O performance.
  • Disk utilization: refers to the disk space usage of the Kafka cluster. If the disk usage is too high, it may cause performance degradation or even accumulation.
  • Network bandwidth: refers to the network transmission speed between Kafka cluster nodes. If the bandwidth is insufficient, message throughput and delay may be limited.

Parameter configuration optimization

The performance of Kafka cluster is affected by multiple parameters. In order to optimize the performance of Kafka cluster, the following key parameters need to be considered:

  • Number of partitions: The number of partitions is critical to the performance of the Kafka cluster, which determines the ability to process messages in parallel. It is critical to strike a balance between parallel processing and distributed storage.
  • Replication factor: Kafka provides a replication mechanism to ensure data reliability. Increasing the replication mechanism can improve fault tolerance, but it will also increase network load and disk usage. The choice of replication factor should be tuned according to the criticality of the data and the needs of the cluster.
  • Batch size: Sending and receiving messages in batches is an important way to optimize Kafka throughput. Larger batch sizes reduce the number of network transfers and I/O operations, thereby increasing throughput. At the same time, a larger batch size will also increase the delay of the message, and a trade-off needs to be made.
  • Maximum connections: The Kafka server handles one connection at a time, so the connection limit is very important for Kafka cluster performance. Too many connections may cause the server to run out of resources, resulting in poor performance.

Architecture design optimization

In order to further improve the performance and reliability of the Kafka cluster, it is necessary to optimize the system architecture of the cluster. The following are some commonly used system architecture optimization methods:

  • Add a caching layer: Use cache to store frequently accessed data in memory, which can reduce I/O load and speed up data access.
  • Data compression (Use data compression): Using the message compression algorithm in the Kafka cluster can greatly reduce network transmission and disk writing.
  • Vertical expansion and horizontal scaling (Vertical and horizontal scaling): By adding nodes or machines to expand the size of the Kafka cluster, thereby improving its performance and fault tolerance.
  • Geo-replication: Distribute multiple Kafka clusters in different geographical locations, and use geo-replication technology to achieve data redundancy and improve data availability.

Guess you like

Origin blog.csdn.net/u010349629/article/details/130906443