Exploring the underlying principles of Kafka

1. Introduction

  1. Kafka is a distributed stream processing platform developed by LinkedIn and follows the Apache open source protocol.
  2. Kafka is mainly used to process real-time data streams, which can publish, subscribe, store and process data.

Application scenario:

  1. Log collection: used in distributed log systems, such as ELK.
  2. Message system: Kafka can be used as a message queue.
  3. Stream Processing: Use Kafka with a stream processing engine like Flink or Spark.

2. Architecture introduction

1. Components

  • Producer: Send data to the Kafka cluster.
  • Consumer: Consume data from the Kafka cluster.
  • Broker: Each server in the Kafka cluster is called a Broker.
  • Topic: Physically different message categories; logically, a Topic contains multiple Partitions.
  • Partition: Physical concept, each Partition corresponds to a folder, under which all messages of the Partition are stored.
  • Offset: Kafka adopts a distributed commit log mechanism. When consumers consume data, they will record the location that has been consumed, that is, Offset.
  • ZooKeeper: Kafka uses ZooKeeper to store the configuration information of the cluster and the status information of various nodes such as Broker, Producer, and Consumer.

2. Cluster

  • A Kafka cluster consists of multiple Brokers, and each Broker has a unique number in the cluster.
  • A Broker can accommodate multiple Partitions, and different Partitions of the same Topic are distributed to different Brokers to form a distributed cluster.
  • The Kafka cluster automatically adjusts the number of Partitions and evenly distributes Partitions to each Broker.

3. Data storage structure

  • Kafka messages are stored in Partitions. Each Partition corresponds to a directory containing multiple Segments. The size of the Segment file is related to the sending rate. A Partition has multiple Segments because Kafka adopts the file system batch read and write mechanism

Code example:

public class KafkaDemo {
    
    

    public static void main(String[] args) {
    
    

        //1.创建生产者
        KafkaProducer<String, String> producer = new KafkaProducer<>(getProperties());

        //2.创建消息
        ProducerRecord<String, String> record = new ProducerRecord<>("myTopic", "key", "value");

        try {
    
    
            //3.发送消息
            producer.send(record).get();
            System.out.println("Sent message successfully");
        } catch (InterruptedException | ExecutionException e) {
    
    
            e.printStackTrace();
        } finally {
    
    
            //4.关闭连接
            producer.close();
        }

    }

    /**
     * 获取Kafka配置信息
     *
     * @return 配置信息
     */
    private static Properties getProperties() {
    
    
        Properties props = new Properties();

        //设置Kafka地址
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");

        //设置消息Key和Value的序列化方式
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

        return props;
    }
}

3. The principle of Kafka message delivery

1. Message producer

Kafka producers send data to the Kafka cluster in the form of messages. Producers can send messages to a specified topic (topic), and can also choose to specify a partition (partition) when sending. When the producer needs to send a message, it first establishes a TCP connection with a Broker on the Kafka cluster, and then sends the message to the Broker.

import org.apache.kafka.clients.producer.*;
import java.util.Properties;

public class KafkaProducerExample {
    
    
    public static void main(String[] args) throws InterruptedException {
    
    

        // 配置Kafka生产者属性
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("acks", "all");
        props.put("retries", 0);
        props.put("batch.size", 16384);
        props.put("linger.ms", 1);
        props.put("buffer.memory", 33554432);
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        // 创建Kafka生产者实例
        Producer<String, String> producer = new KafkaProducer<>(props);

        // 发送消息到指定主题
        for (int i = 0; i < 10; i++) {
    
    
            producer.send(new ProducerRecord<>("my-topic", Integer.toString(i), "Hello World-" + i));
            Thread.sleep(1000); // 每秒发送一条消息
        }

        producer.close(); // 关闭Kafka生产者实例
    }
}

2. Message consumers

Kafka consumers consume messages from one or more partitions in the Kafka cluster. Consumers can subscribe to one or more topics at any time and target specific partitions within each topic.

import org.apache.kafka.clients.consumer.*;
import java.util.Collections;
import java.util.Properties;

public class KafkaConsumerExample {
    
    
    public static void main(String[] args) {
    
    
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("group.id", "my-group"); // 与生产者所在的组相同
        props.put("enable.auto.commit", "true"); // 自动提交偏移量
        props.put("auto.commit.interval.ms", "1000"); // 自动提交偏移量的时间间隔
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        // 创建Kafka消费者实例
        Consumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Collections.singletonList("my-topic")); // 订阅指定主题

        while (true) {
    
    
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records) {
    
    
                // 处理消息
                System.out.printf("offset=%d, key=%s, value=%s%n", record.offset(), record.key(), record.value());
            }
        }

        // consumer.close();
    }
}

3. Topics and partitions

Kafka's topic (topic) is the unit used by Kafka to distinguish message types and categories. Each topic consists of one or more partitions, which are data containers stored on different nodes in the Kafka cluster. Messages for each topic can be distributed across different partitions.

4. Copy mechanism

Kafka's copy mechanism is to ensure the high availability of messages and the persistence of data. When a partition's message is sent to the Kafka cluster, it is replicated to multiple replicas. Each partition has one or more copies, and only one of them is marked as the "leader replica" (leader replica), which is responsible for reading and writing the data of the partition. Other replicas are called "follower replicas" (follower replicas), they can only replicate the data of the leader replica, and thereby improve the reliability and fault tolerance of the system.

4. Message passing process

1. Message sending process

The message sender sends the message to the Kafka topic (topic), and then the Kafka Producer partitions the message and writes it to the specified partition in the Broker. Before sending a message, Producer needs to obtain cluster metadata information from Zookeeper, including Broker list and topic partition information. The specific process is as follows:

  1. The message sender sends the message to the specified topic through the Producer API.

    String topic = "test_topic";
    String message = "Hello, Kafka!";
    ProducerRecord<String, String> record = new ProducerRecord<>(topic, message);
    producer.send(record);
    
  2. According to the key value of the message, the Producer uses the Partitioner algorithm to send the message of the same key to the same partition to ensure the order of the message.

    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes,
                Cluster cluster) {
          
          
        int numPartitions = cluster.partitionCountForTopic(topic);
        if (keyBytes == null) {
          
          
            Random random = new Random();
            return random.nextInt(numPartitions);
        }
        int hash = Utils.murmur2(keyBytes);
        return hash % numPartitions;
    }
    
  3. The Producer stores the records in the buffer, and if the buffer is full, it will call the send method to send the contents of the buffer to the Broker in batches.

2. Message storage process

Messages are stored in one or more partitions of the Broker, and each message in the partition has a unique offset (offset), and is sorted and stored according to other parameters (such as the timestamp of the message). When the Consumer consumes the messages in the partition, it can read the messages according to the offset to ensure the order of the messages.

Messages saved on the Broker are encoded in an efficient and compact format called RecordBatch, which can group related records from multiple Producers together in order to effectively compress the amount of data transfer submitted to the Broker.

3. Message consumption process

Consumer subscribes to and reads messages of a specific topic (topic). The consumer (Consumer) pulls the message of a specific partition from the Broker and processes it. The specific process is as follows:

  1. Consumers send Fetch requests to the Kafka cluster to obtain data.

  2. After the Broker receives the Fetch request, it starts to read the message from the specified partition and offset (offset), and then returns the data to the Consumer.

  3. After the Consumer obtains the data, it processes and consumes it, and records the next pullable offset of each Partition, and submits it to Zookeeper periodically for re-reading after the Consumer fails or restarts Processed messages.

String topicName = "test_topic";
String groupId = "test_group";
Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092");
props.setProperty("group.id", groupId);
props.setProperty("auto.commit.enable", "false");
props.setProperty("auto.offset.reset", "earliest");
props.setProperty("key.deserializer", StringDeserializer.class.getName());
props.setProperty("value.deserializer", StringDeserializer.class.getName());
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topicName));
while (true) {
    
    
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(1000));
    for (ConsumerRecord<String, String> record : records) {
    
    
        System.out.printf("offset = %d, key = %s, value = %s%n", 
          record.offset(), record.key(), record.value());
    }
    consumer.commitSync();
}

5. Performance optimization

1. Hardware optimization

disk

  • When using Kafka, it is recommended to use SSD disks, because the I/O performance of SSDs is better than that of HDD disks.
  • In addition, multiple disks can be used to distribute Kafka data to different disks to reduce the burden on a single disk.

Memory

  • Allocate enough memory to the Kafka Broker process so that it can cache more messages.
  • When Kafka Producer writes a message, you can enable the compression function to reduce the amount of transmitted data and save memory space.

CPU

  • Kafka Broker usually does not have high CPU requirements, but you still need to pay attention to CPU usage under high load.
  • On a multi-core CPU machine, CPU resources can be fully utilized by increasing the number of Broker instances or increasing the number of partitions.

2. Kafka configuration optimization

Producer side

  • acks parameter: Set the level of message acknowledgment. acks=0 means not waiting for server confirmation; acks=1 means that only one server in the Kafka cluster needs to be confirmed; acks=all means that all servers in the Kafka cluster need to be confirmed. The higher the acknowledgment level, the more time-consuming the message will be, but it can provide better data security.
  • batch.size parameter: set the batch size. Smaller batch sizes reduce latency, but also increase CPU overhead. It is recommended to adjust this parameter according to the actual situation.
  • compression.type parameter: set the compression method. Optional compression methods include none (default), gzip, snappy, and lz4. Producers use the compression function to reduce the amount of data transmitted and improve transmission efficiency.

Broker side

  • message.max.bytes parameter: Set the maximum size of a single message. If the message size exceeds this limit, Kafka will reject the message.
  • num.io.threads parameter: Set the number of threads that Broker handles I/O requests. Increasing the value of this parameter can improve the Broker's concurrency capability, but it will also increase the CPU usage.

3. Consumer optimization

  • Group ID: A consumer group is a logical grouping of consumers in Kafka. For consumers in the same group, each partition will only be consumed by one of the consumers. Therefore, setting the Group ID reasonably can improve the effectiveness of consumers.
  • Fetch Size: The number of messages read from Kafka Broker each time. An excessively large Fetch Size will increase the latency of the consumer, and an excessively small Fetch Size will increase network overhead. The best consumption efficiency can be achieved by adjusting this parameter.
  • Processing strategy: Consumers have two ways to process messages, namely poll() and push(). Among them, poll() needs to be actively called by the application, while push() is automatically triggered by Kafka's Consumer thread in the background. In general, using poll() is more flexible than push() and is suitable for most scenarios.

4. Code example

The following is a Java code example for batch writing messages on the Kafka Producer side:

import org.apache.kafka.clients.producer.*;

import java.util.Properties;

public class KafkaProducerSample {
    
    

    public static void main(String[] args) throws Exception {
    
    
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("batch.size", 16384);

        Producer<String, String> producer = new KafkaProducer<>(props);
        String topic = "myTopic";
        for (int i = 0; i < 10000; i++) {
    
    
            String msg = "My message No." + i;
            ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, msg);
            producer.send(record);
        }
        producer.close();
    }
}

6. Advantages and disadvantages of Kafka

1. Advantages of Kafka

  • High throughput and low latency: Kafka implements load balancing through the concept of partition and consumer group, supports distributed deployment, and can achieve high throughput and low latency.
  • High scalability: All nodes in the Kafka cluster are equal, and new nodes can be easily added to the cluster to expand the capacity of the cluster without interrupting the running services.
  • Persistent storage: Kafka data is stored on the disk in the form of files, with high reliability. Even if some nodes fail, the data will not be lost. It is very suitable for continuous storage of large-scale data and offline analysis and processing.
  • High reliability: Kafka supports data backup and copy mechanism, improves its stability through data copy and backup, and ensures that data will not be lost.
  • Diversity of message transmission: Kafka supports the transmission of multiple protocols and can be integrated with different types of applications, such as HTTP RESTful API, clients of various programming languages, and other supplementary tools.

2. Disadvantages of Kafka

  • Deployment and configuration are more complicated: Kafka clusters need to be configured and deployed, which requires certain technical strength. For smaller enterprises, it may take a lot of energy and time to complete the deployment and configuration.
  • Need to process data: Kafka is just a message delivery platform and does not directly process data. Users need to write their own codes for data processing, so it can only be used by technical personnel with high demand.
  • No automatic management: Kafka clusters need to be manually managed, for example, when a node fails, the partition load needs to be rebalanced.

7. Application cases of Kafka

Kafka is an open source distributed message system, which has a wide range of applications in the field of big data. Three application cases of Kafka are introduced below.

1. Web crawlers

The core function of a web crawler is to grab data from the Internet and analyze or save it. Kafka can be used as a message queue for web crawlers, responsible for transmitting the crawled data to the crawler program. After the web crawler finishes processing the data, it sends the data to Kafka for use by subsequent processing programs.

In the specific implementation, you need to create a spiderKafka topic named first, and then write the producer code in the crawler program to send the crawled data to the topic. Here is the Java sample code:

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;

public class SpiderProducer {
    
    
    public static void main(String[] args) {
    
    
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("acks", "all");
        props.put("retries", 0);
        props.put("batch.size", 16384);
        props.put("linger.ms", 1);
        props.put("buffer.memory", 33554432);
        props.put("key.serializer",
            "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer",
            "org.apache.kafka.common.serialization.StringSerializer");

        KafkaProducer<String, String> producer =
            new KafkaProducer<String, String>(props);

        for(int i = 0; i < 100; i++)
            producer.send(new ProducerRecord<String, String>(
                "spider", Integer.toString(i), "data-" + Integer.toString(i)));

        producer.close();
    }
}

2. Statistics

In addition to being used as a message queue, Kafka can also be used as a data cache, which can handle a large number of data streams. In the process of data statistics, Kafka can not only send the collected data to the topic as a producer, but also obtain data from the topic as a consumer and perform operations such as analysis and statistics. In the specific implementation, you need to create a dataKafka topic named first, and then write the producer code in the data collection program to send the data to the topic. Then write the consumer code in the program that processes the data, and get the data from the topic to complete the data statistics.

Here is the Java sample code:

import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.util.Arrays;
import java.util.Properties;

public class DataConsumer {
    
    
    public static void main(String[] args) {
    
    
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("group.id", "test");
        props.put("enable.auto.commit", "true");
        props.put("auto.commit.interval.ms", "1000");
        props.put("key.deserializer",
            "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer",
            "org.apache.kafka.common.serialization.StringDeserializer");

        KafkaConsumer<String, String> consumer =
            new KafkaConsumer<String, String>(props);

        consumer.subscribe(Arrays.asList("data"));

        while (true) {
    
    
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records)
                System.out.printf("offset = %d, key = %s, value = %s\n",
                    record.offset(), record.key(), record.value());
        }
    }
}

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;

public class DataProducer {
    
    
    public static void main(String[] args) {
    
    
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("acks", "all");
        props.put("retries", 0);
        props.put("batch.size", 16384);
        props.put("linger.ms", 1);
        props.put("buffer.memory", 33554432);
        props.put("key.serializer",
            "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer",
            "org.apache.kafka.common.serialization.StringSerializer");

        KafkaProducer<String, String> producer =
            new KafkaProducer<String, String>(props);

        for(int i = 0; i < 100; i++)
            producer.send(new ProducerRecord<String, String>(
                "data", Integer.toString(i), "data-" + Integer.toString(i)));

        producer.close();
    }
}

3. Real-time monitoring

Kafka can be used as a transmission medium in real-time monitoring to send source data streams to consumers to meet distributed requirements. In the specific implementation, you need to create a metricsKafka topic named first, and then send the monitoring data to the topic in the producer program. Then write consumer code in the monitoring center, get data from the topic and perform operations such as analysis and monitoring.

Here is the Java sample code:

import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.util.Arrays;
import java.util.Properties;

public class MetricsConsumer {
    
    
    public static void main(String[] args) {
    
    
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("group.id", "test");
        props.put("enable.auto.commit", "true");
        props.put("auto.commit.interval.ms", "1000");
        props.put("key.deserializer",
            "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer",
            "org.apache.kafka.common.serialization.StringDeserializer");

        KafkaConsumer<String, String> consumer =
            new KafkaConsumer<String, String>(props);

        consumer.subscribe(Arrays.asList("metrics"));

        while (true) {
    
    
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records)
                System.out.printf("offset = %d, key = %s, value = %s\n",
                    record.offset(), record.key(), record.value());
        }
    }
}

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;

public class MetricsProducer {
    
    
    public static void main(String[] args) {
    
    
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("acks", "all");
        props.put("retries", 0);
        props.put("batch.size", 16384);
        props.put("linger.ms", 1);
        props.put("buffer.memory", 33554432);
        props.put("key.serializer",
            "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer",
            "org.apache.kafka.common.serialization.StringSerializer");

        KafkaProducer<String, String> producer =
            new KafkaProducer<String, String>(props);

        for(int i = 0; i < 100; i++)
            producer.send(new ProducerRecord<String, String>(
                "metrics", Integer.toString(i), "metrics-" + Integer.toString(i)));

        producer.close();
    }
}

Guess you like

Origin blog.csdn.net/u010349629/article/details/130933486